

我已經從Chae-Deug Park等來源研究過這些,但沒有討論過將簡單句子作爲訓練數據進行討論。



你說的「簡單的句子」究竟是什麼意思?只是一個句子而不是一個段落 - 在這種情況下,您的問題是關於句子邊界檢測。或者只包含一個主謂語的句子(而不是一個複雜的句子,其中有從句等)?或者完全不同的東西? – jogojapan 2012-04-11 03:19:49


嗨jogojapan,是的,這是正確的,我的意思只是一個句子,而不是一個段落... – 2012-04-14 22:39:42


你沒有正確定義你的意思是一個簡單的句子,所以它很難讓任何人回答你的問題。也許你想用斯坦福分析器這樣的東西來得到每個句子的解析樹,並去除所有不屬於「NP VP」類型的句子,即構成名詞短語後跟動詞短語的句子(例如'[約翰] [坐在長凳上]','[瑪麗和吉爾] [吃了他們的三明治]等等) – 2012-04-17 07:21:00




public static List<String> breakIntoSentencesOpenNlp(String paragraph) throws FileNotFoundException, IOException, 
     InvalidFormatException { 

    InputStream is = new FileInputStream("resources/models/en-sent.bin"); 
    SentenceModel model = new SentenceModel(is); 
    SentenceDetectorME sdetector = new SentenceDetectorME(model); 

    String[] sentDetect = sdetector.sentDetect(paragraph); 
    return Arrays.asList(sentDetect); 

//Failed at Hi. 
    paragraph = "Hi. How are you? This is Mike."; 
    SentenceDetector.breakIntoSentencesOpenNlp(paragraph).forEach(sentence -> System.out.println(sentence)); 

    //Failed at Door.Noone 
    paragraph = "Close the Door.Noone is out there"; 
    SentenceDetector.breakIntoSentencesOpenNlp(paragraph).forEach(sentence -> System.out.println(sentence));//not able to break on noone 

    paragraph = "Really!! I cant believe. Mr. Wilson can come any moment to receive mrs. watson."; 
    SentenceDetector.breakIntoSentencesOpenNlp(paragraph).forEach(sentence -> System.out.println(sentence)); 

    //Failed at dr. 
    paragraph = "Radhika, Mohan, and Shaik went to meet dr. Kashyap to raise fund for poor patients."; 
    SentenceDetector.breakIntoSentencesOpenNlp(paragraph).forEach(sentence -> System.out.println(sentence));//breaking on dr. 

    paragraph = "This is how I tried to split a paragraph into a sentence. But, there is a problem. My paragraph includes dates like Jan.13, 2014 , words like U.S. and numbers like 2.2. They all got splitted by the above code."; 
    SentenceDetector.breakIntoSentencesOpenNlp(paragraph).forEach(sentence -> System.out.println(sentence));//breaking on dr. 

    paragraph = "www.thinkzarahatke.com is the second site I developed. You can send mail to [email protected]"; 
    SentenceDetector.breakIntoSentencesOpenNlp(paragraph).forEach(sentence -> System.out.println(sentence)); 

它失敗時,纔會有人類的錯誤。例如。 「博士」縮寫應該有大寫字母D,並且在2個句子之間至少有1個空格。


public static List<String> breakIntoSentencesCustomRESplitter(String paragraph){ 
    List<String> sentences = new ArrayList<String>(); 
    Pattern re = Pattern.compile("[^.!?\\s][^.!?]*(?:[.!?](?!['\"]?\\s|$)[^.!?]*)*[.!?]?['\"]?(?=\\s|$)", Pattern.MULTILINE | Pattern.COMMENTS); 
    Matcher reMatcher = re.matcher(paragraph); 
    while (reMatcher.find()) { 
    return sentences; 


paragraph = "Hi. How are you? This is Mike."; 
    SentenceDetector.breakIntoSentencesCustomRESplitter(paragraph).forEach(sentence -> System.out.println(sentence)); 

    //Failed at Door.Noone 
    paragraph = "Close the Door.Noone is out there"; 
    SentenceDetector.breakIntoSentencesCustomRESplitter(paragraph).forEach(sentence -> System.out.println(sentence)); 

    //Failed at Mr., mrs. 
    paragraph = "Really!! I cant believe. Mr. Wilson can come any moment to receive mrs. watson."; 
    SentenceDetector.breakIntoSentencesCustomRESplitter(paragraph).forEach(sentence -> System.out.println(sentence)); 

    //Failed at dr. 
    paragraph = "Radhika, Mohan, and Shaik went to meet dr. Kashyap to raise fund for poor patients."; 
    SentenceDetector.breakIntoSentencesCustomRESplitter(paragraph).forEach(sentence -> System.out.println(sentence)); 

    //Failed at U.S. 
    paragraph = "This is how I tried to split a paragraph into a sentence. But, there is a problem. My paragraph includes dates like Jan.13, 2014 , words like U.S. and numbers like 2.2. They all got splitted by the above code."; 
    SentenceDetector.breakIntoSentencesCustomRESplitter(paragraph).forEach(sentence -> System.out.println(sentence)); 

    paragraph = "www.thinkzarahatke.com is the second site I developed. You can send mail to [email protected]"; 
    SentenceDetector.breakIntoSentencesCustomRESplitter(paragraph).forEach(sentence -> System.out.println(sentence)); 


public static List<String> breakIntoSentencesBreakIterator(String paragraph){ 
    List<String> sentences = new ArrayList<String>(); 
    BreakIterator sentenceIterator = 
    BreakIterator sentenceInstance = sentenceIterator.getSentenceInstance(); 

    int end = sentenceInstance.last(); 
    for (int start = sentenceInstance.previous(); 
      start != BreakIterator.DONE; 
      end = start, start = sentenceInstance.previous()) { 

    return sentences; 


paragraph = "Hi. How are you? This is Mike."; 
    SentenceDetector.breakIntoSentencesBreakIterator(paragraph).forEach(sentence -> System.out.println(sentence)); 

    //Failed at Door.Noone 
    paragraph = "Close the Door.Noone is out there"; 
    SentenceDetector.breakIntoSentencesBreakIterator(paragraph).forEach(sentence -> System.out.println(sentence)); 

    //Failed at Mr. 
    paragraph = "Really!! I cant believe. Mr. Wilson can come any moment to receive mrs. watson."; 
    SentenceDetector.breakIntoSentencesBreakIterator(paragraph).forEach(sentence -> System.out.println(sentence)); 

    //Failed at dr. 
    paragraph = "Radhika, Mohan, and Shaik went to meet dr. Kashyap to raise fund for poor patients."; 
    SentenceDetector.breakIntoSentencesBreakIterator(paragraph).forEach(sentence -> System.out.println(sentence)); 

    paragraph = "This is how I tried to split a paragraph into a sentence. But, there is a problem. My paragraph includes dates like Jan.13, 2014 , words like U.S. and numbers like 2.2. They all got splitted by the above code."; 
    SentenceDetector.breakIntoSentencesBreakIterator(paragraph).forEach(sentence -> System.out.println(sentence)); 

    paragraph = "www.thinkzarahatke.com is the second site I developed. You can send mail to [email protected]"; 
    SentenceDetector.breakIntoSentencesBreakIterator(paragraph).forEach(sentence -> System.out.println(sentence)); 


  • 定製RE:7毫秒
  • 的BreakIterator:143毫秒
  • openNlp:255毫秒

看看Apache OpenNLP,它有一個句子的檢測器模塊。該文檔提供瞭如何從命令行和API使用它的示例。