使用正則表達式在文本文件中搜索一些短語C＃

編寫一個程序，對文本文件中的短語進行計數。任何字符序列都可以作爲用於計數的短語給出，甚至包含分隔符的序列。例如，在「我是索非亞的學生」的文本中，短語「s」，「stu」，「a」和「我是」分別被找到2,1,3和1次。

我知道有string.IndexOf或LINQ或一些類似阿霍Corasick型算法的解決方案。我想用Regex做同樣的事情。

這是我迄今所做的：

using System; 
using System.Collections.Generic; 
using System.IO; 
using System.Text.RegularExpressions; 

namespace CountThePhrasesInATextFile 
{ 
    class Program 
    { 
     static void Main(string[] args) 
     { 
      string input = ReadInput("file.txt"); 
      input.ToLower(); 
      List<string> phrases = new List<string>(); 
      using (StreamReader reader = new StreamReader("words.txt")) 
      { 
       string line = reader.ReadLine(); 
       while (line != null) 
       { 
        phrases.Add(line.Trim()); 
        line = reader.ReadLine(); 
       } 
      } 
      foreach (string phrase in phrases) 
      { 
       Regex regex = new Regex(String.Format(".*" + phrase.ToLower() + ".*")); 
       int mathes = regex.Matches(input).Count; 
       Console.WriteLine(phrase + " ----> " + mathes); 
      } 
     } 

     private static string ReadInput(string fileName) 
     { 
      string output; 
      using (StreamReader reader = new StreamReader(fileName)) 
      { 
       output = reader.ReadToEnd(); 
      } 
      return output; 
     } 
    } 
}

我知道我的正則表達式是不正確，但我不知道是什麼改變。

輸出：

Word ----> 2 
S ----> 2 
MissingWord ----> 0 
DS ----> 2 
aa ----> 0

正確的輸出：

Word --> 9 
S --> 13 
MissingWord --> 0 
DS --> 2 
aa --> 3

file.txt的包含：

Word? We have few words: first word, second word, third word. 
Some passwords: PASSWORD123, @PaSsWoRd!456, AAaA, !PASSWORD

words.txt包含：

Word 
S 
MissingWord 
DS 
aa

來源

2016-08-15 Dan

「我知道我的正則表達式是不正確」我們永遠不會知道，直到你發佈你的代碼的說法是真實的。我99％肯定它的錯誤 – Steve

請發佈您的'file.txt'內容 –

.NET中的字符串是不可變的。所以需要編寫'input = input.ToLower（）;' –

您需要首先發布file.txt內容，否則很難驗證正則表達式是否正常工作。

這就是說，看看正則表達式的答案在這裏： Finding ALL positions of a substring in a large string in C# 並看看是否有助於您的代碼在同一時間。

編輯：

所以這是一個簡單的解決方案，加上「（？=（」和「））」您的每一個短語。這是正則表達式中的一個前瞻斷言。下面的代碼處理你想要的。

 foreach (string phrase in phrases) { 
      string MatchPhrase = "(?=(" + phrase.ToLower() + "))"; 
      int mathes = Regex.Matches(input, MatchPhrase).Count; 
      Console.WriteLine(phrase + " ----> " + mathes); 
     }

您也有一個問題與

input.ToLower();

這應該是代替

input = input.ToLower();

在C＃中的字符串是不可變的。總共，您的代碼應該是：

static void Main(string[] args) { 
     string input = ReadInput("file.txt"); 
     input = input.ToLower(); 
     List<string> phrases = new List<string>(); 
     using (StreamReader reader = new StreamReader("words.txt")) { 
      string line = reader.ReadLine(); 
      while (line != null) { 
       phrases.Add(line.Trim()); 
       line = reader.ReadLine(); 
      } 
     } 
     foreach (string phrase in phrases) { 
      string MatchPhrase = "(?=(" + phrase.ToLower() + "))"; 
      int mathes = Regex.Matches(input, MatchPhrase).Count; 
      Console.WriteLine(phrase + " ----> " + mathes); 
     } 
     Thread.Sleep(50000); 
    } 

    private static string ReadInput(string fileName) { 
     string output; 
     using (StreamReader reader = new StreamReader(fileName)) { 
      output = reader.ReadToEnd(); 
     } 
     return output; 
    }

來源

2016-08-15 15:51:41

我發佈了file.txt的內容 – Dan

你是否希望明確地匹配，這樣'aa'只與'aaa'或'aaAaa'不匹配匹配'aa'？ –

在AAaA中，aa將匹配3次 – Dan

這裏是發生了什麼事。我將以Word爲例。

您爲「單詞」構建的正則表達式是「。單詞。」「。它告訴正則表達式匹配任何從任何東西開始，包含「單詞」並以任何結束。

爲您的輸入，它匹配

單詞？我們有幾句話：第一個字，第二個字，第三個字。

與"Word? We have few words: first"開始並以", second word, third word."

然後第二行結束始於"Some pass"包含"word"與": PASSWORD123, @PaSsWoRd!456, AAaA, !PASSWORD"

因此計數爲2

你想要的正則表達式是簡單的，串"word"結束足夠了。

更新：

爲忽略大小寫模式嘗試"(?i)word"

而對於內AAAA多場比賽，嘗試"(?i)(?<=a)a"

?<=是一個零寬度正回顧後發斷言

來源

2016-08-15 15:55:45 Steve

的學習經驗，但它並不適用於所有情況 – Dan

工作@丹試試「（我）字」 – Steve

它的工作除了爲AA的所有情況。它匹配2次而不是3次。 – Dan

試試這個代碼：

string input = File.ReadAllText("file.txt"); 

foreach (string word in File.ReadLines("words.txt")) 
{ 
    var regex = new Regex(word, RegexOptions.IgnoreCase); 
    int startat = 0; 
    int count = 0; 

    Match match = regex.Match(input, startat); 
    while (match.Success) 
    { 
     count++; 
     startat = match.Index + 1; 
     match = regex.Match(input, startat); 
    } 

    Console.WriteLine(word + "\t" + count); 
}

要正確查找「aa」等所有子字符串，必須使用startat參數的過載Match方法。

注意RegexOptions.IgnoreCase參數。

較短，但不太清楚代碼：

Match match; 
while ((match = regex.Match(input, startat)).Success) 
{ 
    count++; 
    startat = match.Index + 1; 
}

來源

2016-08-15 17:29:54

不需要忽略的情況，因爲我使用\ toLower方法 – Dan

@丹 - 但我不知道。 ;） –

你可以讓你的代碼適合我的 – Dan

使用正則表達式在文本文件中搜索一些短語C＃

回答

相關問題