如何逐字讀取文字

我正在處理一個txt或htm文件。目前我正在用char查找文件char，使用for循環，但我需要逐字查找文本，然後在字符char中查找char。我該怎麼做？如何逐字讀取文字

for (int i = 0; i < text.Length; i++) 
{}

來源

2013-03-05 Hurrem

你需要你的文件中界定的話的一種方式。空格可能有效，但我可以看到標點符號等問題。 – DGibbs 2013-03-05 17:05:13

使用正則表達式匹配表示單詞的模式。然後通過字符搜索匹配字符 – Alan 2013-03-05 17:06:13

你是什麼類的單詞？具體當看一個HTML文件？ – 2013-03-05 17:06:38

使用text.Split(' ')將其按空格拆分爲一個單詞數組，然後遍歷該單詞。

所以

foreach(String word in text.Split(' ')) 
    foreach(Char c in word) 
     Console.WriteLine(c);

來源

2013-03-05 17:07:28 mdubez

你可能分裂的空格：

string[] words = text.split(' ')

會給你的話的數組，那麼你可以在它們之間進行迭代。

foreach(string word in words) 
{ 
    word // do something with each word 
}

來源

2013-03-05 17:07:44

我認爲你可以使用拆分

  var words = reader.ReadToEnd().Split(' ');

或使用

foreach(String words in text.Split(' ')) 
    foreach(Char char in words)

來源

2013-03-05 17:07:59

您可以分割上空白的字符串，但你將不得不處理標點符號和HTML標記（您說你正在使用txt和htm文件）。

string[] tokens = text.split(); // default for split() will split on white space 
foreach(string tok in tokens) 
{ 
    // process tok string here 
}

來源

2013-03-05 17:08:16 toby

一種簡單的方法是使用無string.Split參數（由空白字符分割）：

using (StreamReader sr = new StreamReader(path)) 
{ 
    while (sr.Peek() >= 0) 
    { 
     string line = sr.ReadLine(); 
     string[] words = line.Split(); 
     foreach(string word in words) 
     { 
      foreach(Char c in word) 
      { 
       // ... 
      } 
     } 
    } 
}

我使用StreamReader.ReadLine讀取整個行。

解析HTML我會使用一個強大的庫，如HtmlAgilityPack。

來源

2013-03-05 17:09:27

你可以get all the text from some HTML與HTMLAgilityPack。如果你認爲這是過分的看起來here。

HtmlDocument doc = new HtmlDocument(); 
doc.LoadHtml(text); 

foreach(HtmlNode node in doc.DocumentNode.SelectNodes("//text()")) 
{ 
    var nodeText = node.InnerText; 
}

然後，您可以將每個節點的文本內容拆分爲單詞，一旦您定義了一個單詞是什麼。

也許就像this，

using HtmlAgilityPack; 

static IEnumerable<string> WordsInHtml(string text) 
{ 
    var splitter = new Regex(@"[^\p{L}]*\p{Z}[^\p{L}]*"); 

    HtmlDocument doc = new HtmlDocument(); 
    doc.LoadHtml(text); 

    foreach(HtmlNode node in doc.DocumentNode.SelectNodes("//text()")) 
    { 
     foreach(var word in splitter.Split(node.InnerText) 
     { 
      yield return word; 
     } 
    } 
}

然後，檢查每個字

foreach(var word in WordsInHtml(text)) 
{ 
    foreach(var c in word) 
    { 
     // a enumeration by word then char. 
    } 
}

來源

2013-03-05 17:20:01 Jodrell

關於什麼的正則表達式的字符？

using System; 
using System.Linq; 
using System.Text.RegularExpressions; 

namespace ConsoleApplication58 
{ 
    class Program 
    { 
     static void Main() 
     { 
      string input = 
       @"I'm working with a txt or htm file. And currently I'm looking up the document char by char, using for loop, but I need to look up the text word by word, and then inside the word char by char. How can I do this?"; 
      var list = from Match match in Regex.Matches(input, @"\b\S+\b") 
         select match.Value; //Get IEnumerable of words 
      foreach (string s in list) 
       Console.WriteLine(s); //doing something with it 
      Console.ReadKey(); 
     } 
    } 
}

它可以與任何分隔符一起工作，並且它是最快的方式來做它afaik。

來源

2013-03-05 17:39:50 Psilon

這裏是我的懶惰擴展到StreamReader的實現。這個想法不是把整個文件加載到內存中，特別是如果你的文件是一個單獨的長行。

public static string ReadWord(this StreamReader stream, Encoding encoding) 
{ 
    string word = ""; 
    // read single character at a time building a word 
    // until reaching whitespace or (-1) 
    while(stream.Read() 
     .With(c => { // with each character . . . 
      // convert read bytes to char 
      var chr = encoding.GetChars(BitConverter.GetBytes(c)).First(); 

      if (c == -1 || Char.IsWhiteSpace(chr)) 
       return -1; //signal end of word 
      else 
       word = word + chr; //append the char to our word 

      return c; 
    }) > -1); // end while(stream.Read() if char returned is -1 
    return word; 
} 

public static T With<T>(this T obj, Func<T,T> f) 
{ 
    return f(obj); 
}

簡單地使用：

using (var s = File.OpenText(file)) 
{ 
    while(!s.EndOfStream) 
     s.ReadWord(Encoding.Default).ToCharArray().DoSomething(); 
}

來源

2014-02-15 01:06:02

如何逐字讀取文字

回答

相關問題