解析單個句子的正則表達式是什麼？

我正在尋找一個很好的.NET正則表達式，我可以使用這個正則表達式從文本體中解析出單個句子。解析單個句子的正則表達式是什麼？

它應該能夠下列文本塊解析成正好六個句子：

Hello world! How are you? I am fine. 
This is a difficult sentence because I use I.D. 

Newlines should also be accepted. Numbers should not cause 
sentence breaks, like 1.23.

這被證明是一個更具有挑戰性，比我原來想象。

任何幫助將不勝感激。我將用它來訓練已知文本體系的系統。

來源

2009-12-20 Luke Machowski

@Luke：它看起來像你想在你的示例文本「因」和「句子」之間的可見換行符，但它沒有顯示出來。我強制它通過在換行之前插入兩個空格來顯示。這就是你想要的樣子，不是嗎？ – 2009-12-20 18:05:50

是的，你現在就在！感謝您解決這個問題。傻我（仍然是小白）。 – 2009-12-22 20:40:00

試試這個@"(\S.+?[.!?])(?=\s+|$)"：

string [email protected]"Hello world! How are you? I am fine. This is a difficult sentence because I use I.D. 
Newlines should also be accepted. Numbers should not cause sentence breaks, like 1.23."; 

Regex rx = new Regex(@"(\S.+?[.!?])(?=\s+|$)"); 
foreach (Match match in rx.Matches(str)) { 
    int i = match.Index; 
    Console.WriteLine(match.Value); 
}

結果：

Hello world! 
How are you? 
I am fine. 
This is a difficult sentence because I use I.D. 
Newlines should also be accepted. 
Numbers should not cause sentence breaks, like 1.23.

對於複雜的問題，當然，你需要像SharpNLP或NLTK一個真正的解析器。我的只是一個快速和骯髒的。

這裏是SharpNLP信息和特性：

SharpNLP是寫在 C＃自然語言處理工具的集合。目前，它提供了以下NLP工具：

一句分流
一個標記
部分的語音捉
一個組塊（用來「發現非遞歸語法註解如名詞短語塊「）
解析器
名稱取景
一個共指工具
到WordNet的詞彙數據庫

來源

2009-12-20 17:20:11 YOU

+1指向我以前沒見過的SharpNLP，可能非常有用。 – 2009-12-20 17:41:35

爲''（？：\ s + | $）''更好地使用前瞻斷言。 – Gumbo 2009-12-21 08:27:11

感謝信息Gumbo，它更好，但我必須在前面添加\ S，因爲空白區域必須在左側剝離。 – YOU 2009-12-21 08:58:10

這不是唯一的正則表達式真的有可能，除非你確切地知道哪些「困難」的標記，你有，如「ID」的界面，「先生」例如，有多少句子是「請出示您的證件，邦德先生」？我不熟悉任何C＃實現，但我已經使用NLTK的Punkt tokenizer。可能不應該太難重新實施。

來源

2009-12-20 17:23:38

var str = @"Hello world! How are you? I am fine. This is a difficult sentence because I use I.D. 
Newlines should also be accepted. Numbers should not cause sentence breaks, like 1.23."; 

Regex.Split(str, @"(?<=[.?!])\s+").Dump();

我在LINQPad中測試了這個。

來源

2009-12-20 17:24:08 SLaks

感謝您的嘗試。 – 2009-12-22 20:46:25

使用正則表達式來解析自然語言是不可能的。句子的結尾是什麼？一段時間可能發生在許多地方（例如）。您應該使用自然語言解析工具包，例如OpenNLP或NLTK。不幸的是，C＃中很少有產品。因此，您可能必須創建一個web服務或以其他方式鏈接到C＃。

注意，這將導致問題在未來如果依靠準確的空白，如「內徑」。你很快就會找到打破你的正則表達式的例子。例如，大多數人在他們的禮服之後放置空格。

在WP（http://en.wikipedia.org/wiki/Natural_language_processing_toolkits）中有一個很好的開放和商業產品摘要。我們已經使用了其中的幾個。這是值得的。

[您使用「火車」一詞。這通常與機器學習相關（這是NLP的一種方法，並且已經用於句子分割）。事實上，我提到的工具包包括機器學習。我懷疑這不是你的意思 - 而是你會通過啓發式發展你的表達。不]

來源

2009-12-20 17:29:47

該信息的謝謝。我總是對機器學習方面感興趣，這是我想調查的一個方面。對於我目前的目的，我實際上認爲簡單的正則表達式方法（我不指望你提到的這些奇怪的情況）會很好。但是，我會嘗試你所說的框架，因爲它們已經存在。 – 2009-12-22 20:45:27

我用貼在這裏的建議，並與接縫要達到什麼我想要做的正則表達式來了！

(?<Sentence>\S.+?(?<Terminator>[.!?]|\Z))(?=\s+|\Z)

我用Expresso拿出：

// using System.Text.RegularExpressions; 
/// <summary> 
/// Regular expression built for C# on: Sun, Dec 27, 2009, 03:05:24 PM 
/// Using Expresso Version: 3.0.3276, http://www.ultrapico.com 
/// 
/// A description of the regular expression: 
/// 
/// [Sentence]: A named capture group. [\S.+?(?<Terminator>[.!?]|\Z)] 
///  \S.+?(?<Terminator>[.!?]|\Z) 
///   Anything other than whitespace 
///   Any character, one or more repetitions, as few as possible 
///   [Terminator]: A named capture group. [[.!?]|\Z] 
///    Select from 2 alternatives 
///     Any character in this class: [.!?] 
///     End of string or before new line at end of string 
/// Match a suffix but exclude it from the capture. [\s+|\Z] 
///  Select from 2 alternatives 
///   Whitespace, one or more repetitions 
///   End of string or before new line at end of string 
/// 
/// 
/// </summary> 
public static Regex regex = new Regex(
     "(?<Sentence>\\S.+?(?<Terminator>[.!?]|\\Z))(?=\\s+|\\Z)", 
    RegexOptions.CultureInvariant 
    | RegexOptions.IgnorePatternWhitespace 
    | RegexOptions.Compiled 
    ); 


// This is the replacement string 
public static string regexReplace = 
     "$& [${Day}-${Month}-${Year}]"; 


//// Replace the matched text in the InputText using the replacement pattern 
// string result = regex.Replace(InputText,regexReplace); 

//// Split the InputText wherever the regex matches 
// string[] results = regex.Split(InputText); 

//// Capture the first Match, if any, in the InputText 
// Match m = regex.Match(InputText); 

//// Capture all Matches in the InputText 
// MatchCollection ms = regex.Matches(InputText); 

//// Test to see if there is a match in the InputText 
// bool IsMatch = regex.IsMatch(InputText); 

//// Get the names of all the named and numbered capture groups 
// string[] GroupNames = regex.GetGroupNames(); 

//// Get the numbers of all the named and numbered capture groups 
// int[] GroupNumbers = regex.GetGroupNumbers();

來源

2009-12-27 13:07:19

大多數人建議使用SharpNLP，並且您應該這樣做，除非您希望您的QA部門有一個bug巨星。

但是因爲你可能處於某種壓力之下。這是另一個處理像「博士」這樣的詞的嘗試。和「X」。但是，它會以「it」結尾的句子失敗。

Hello world！你好嗎？我很好。這是一個難以判斷的句子，因爲我使用I.D.換行符也應該被接受。數字不應該是會造成句子中斷，如1.23。見B博士或FooBar先生對幽門螺桿菌進行評估。

var result = new Regex(@"(\S.+?[.!?])(?=\s+|$)(?<!\s([A-Z]|[a-z]){1,3}.)").Split(input).Where(s => !String.IsNullOrWhiteSpace(s)).ToArray<string>(); 
    foreach (var match in result) 
    { 
     Console.WriteLine(match); 
    }

來源

2016-01-05 19:44:00

解析單個句子的正則表達式是什麼？

回答

相關問題