2010-12-21 98 views
0

我試圖讓這個正則表達式工作捕獲製表符分隔線上的字段。這似乎對所有情況下,當線與兩個標籤開始除外:正則表達式來解析製表符分隔文件

^\t|"(?<field>[^"]+|\t(?=\t))"|(?<field>[^\t]+|\t(?=\t))|\t$ 

例如,其中\ t表示標籤:

\t \t 123 \t abc \t 345 \t efg 

僅捕獲5字段省略第一之一「空白「(標籤)

+1

爲什麼不只是使用帶有標籤的CSV庫? – 2010-12-21 23:48:20

+0

什麼是預期的行爲? – Restuta 2010-12-22 00:03:49

+0

預期行爲將捕獲六個字段 – mike 2010-12-22 15:32:20

回答

3

正則表達式可能不是此工作的最佳工具。我建議您使用TextFieldParser類,該類用於解析帶有分隔或固定長度字段的文件。它駐留在Microsoft.VisualBasic程序集中的事實有點令人討厭,如果你使用C#編碼,但它並不妨礙你使用它...

1

同意Regex不是適合工作的工具這裏。

當Thomas在框架中發佈鏈接到一個漂亮的小寶石時,我正在清理這個代碼。我用這種方法來解析可能包含帶引號的字符串和轉義字符的分隔文本。這可能不是世界上最優化的,但在我看來它非常易讀,而且完成了工作。

/// <summary> 
/// Breaks a string into tokens using a delimeter and specified text qualifier and escape sequence. 
/// </summary> 
/// <param name="line">The string to tokenize.</param> 
/// <param name="delimeter">The delimeter between tokens, such as a comma.</param> 
/// <param name="textQualifier">The text qualifier which enables the delimeter to be embedded in a single token.</param> 
/// <param name="escapeSequence">The escape sequence which enables the text qualifier to be embedded in a token.</param> 
/// <returns>A collection of string tokens.</returns> 
public static IEnumerable<string> Tokenize(string line, char delimeter, char textQualifier = '\"', char escapeSequence = '\\') 
{ 

    var inString = false; 
    var escapeNext = false; 
    var token = new StringBuilder(); 

    for (int i = 0 ; i < line.Length ; i++) { 

     // If the last character was an escape sequence, then it doesn't matter what 
     // this character is (field terminator, text qualifier, etc) because it needs 
     // to appear as a part of the field value. 

     if (escapeNext) { 
      escapeNext = false; 
      token.Append(line[i]); 
      continue; 
     } 

     if (line[i] == escapeSequence) { 
      escapeNext = true; 
      continue; 
     } 

     if (line[i] == textQualifier) { 
      inString = !inString; 
      continue; 
     } 

     // hit the end of the current token? 
     if (line[i] == delimeter && !inString) { 

      yield return token.ToString(); 

      // clear the string builder (instead of allocating a new one) 
      token.Remove(0, token.Length); 

      continue; 

     } 

     token.Append(line[i]); 

    } 

    yield return token.ToString(); 

}