2015-10-15 66 views
4

我使用NLTK的PUNKT句子標記生成器到文件分割成句子的列表,並希望在文件中保存的空行:保留空行與NLTK的PUNKT標記者

from nltk import data 
tokenizer = data.load('tokenizers/punkt/english.pickle') 
s = "That was a very loud beep.\n\n I don't even know\n if this is working. Mark?\n\n Mark are you there?\n\n\n" 
sentences = tokenizer.tokenize(s) 
print sentences 

我會這樣的打印:

['That was a very loud beep.\n\n', "I don't even know\n if this is working.", 'Mark?\n\n', 'Mark are you there?\n\n\n'] 

但是,這實際打印表明尾隨空行已經從第一和第三句刪除內容:

['That was a very loud beep.', "I don't even know\n if this is working.", 'Mark?', 'Mark are you there?\n\n\n'] 

Other tokenizers在NLTK有一個blanklines='keep'參數,但我沒有看到任何這樣的選項在Punkt標記器的情況下。這很可能我錯過了一些簡單的東西。有沒有辦法使用Punkt語句標記器重新訓練這些尾隨的空行?我會很感激別人可以提供的任何見解!

+1

無論NLTK使用,你可以預裂的換行符(多個新行),然後將文本使用NLTK對得到的塊 –

+0

@VsevolodDyomkin有趣的想法;在那種情況下,如何處理分散在多行中的句子? – duhaime

+0

對於這種情況它只是不起作用:( –

回答

7

問題

可悲的是,你不能讓標記生成器保持blanklines,不與編寫程序的方式。

Starting here並按照功能通過span_tokenize()和_slices_from_text(),你可以看到有一個條件

if match.group('next_tok'):

,旨在確保標記生成器跳過空白,直到下一個可能的話開始呼籲令牌發生。尋找這個引用的正則表達式,我們最終看着_period_context_fmt,我們看到next_tok命名組前面有\s+,其中空白行將不會被捕獲。

解決方案

分解,更改你不喜歡,你重新裝定製解決方案的一部分。

現在這個正則表達式在PunktLanguageVars類中,它本身用於初始化PunktSentenceTokenizer類。我們只需從PunktLanguageVars派生出一個自定義類,然後按照我們希望的方式修復這個正則表達式。

我們要修復的方法是包括在一個句子的末尾尾隨換行,所以我建議更換_period_context_fmt,從這個打算:

_period_context_fmt = r""" 
    \S*       # some word material 
    %(SentEndChars)s    # a potential sentence ending 
    (?=(?P<after_tok> 
     %(NonWord)s    # either other punctuation 
     | 
     \s+(?P<next_tok>\S+)  # or whitespace and some other token 
    ))""" 

這樣:

_period_context_fmt = r""" 
    \S*       # some word material 
    %(SentEndChars)s    # a potential sentence ending 
    \s*      # <-- THIS is what I changed 
    (?=(?P<after_tok> 
     %(NonWord)s    # either other punctuation 
     | 
     (?P<next_tok>\S+)  # <-- Normally you would have \s+ here 
    ))""" 

現在使用此正則表達式代替舊版本的標記器將在句子結束後包含0個或更多\s個字符。

整個腳本

import nltk.tokenize.punkt as pkt 

class CustomLanguageVars(pkt.PunktLanguageVars): 

    _period_context_fmt = r""" 
     \S*       # some word material 
     %(SentEndChars)s    # a potential sentence ending 
     \s*      # <-- THIS is what I changed 
     (?=(?P<after_tok> 
      %(NonWord)s    # either other punctuation 
      | 
      (?P<next_tok>\S+)  # <-- Normally you would have \s+ here 
     ))""" 

custom_tknzr = pkt.PunktSentenceTokenizer(lang_vars=CustomLanguageVars()) 

s = "That was a very loud beep.\n\n I don't even know\n if this is working. Mark?\n\n Mark are you there?\n\n\n" 

print(custom_tknzr.tokenize(s)) 

此輸出:

['That was a very loud beep.\n\n ', "I don't even know\n if this is working. ", 'Mark?\n\n ', 'Mark are you there?\n\n\n'] 
+0

@duhaime,我將我的解決方案腳本改爲非冗餘的,因爲我們只需要重新定義正則表達式,沒有必要重新定義使用它的方法。乾杯! – HugoMailhot

+0

這絕對是完美的。你的代碼片段教會了我很多關於NLTK的繼承。你呢! – duhaime

0

我會itertools.groupby去,看Python: How to loop through blocks of lines

[email protected]:~$ echo """This is a foo bar sentence, 
that is also a foo bar sentence. 

But I don't like foobars. 
Yes you do like bars with foos, no? 


I'm not sure whether you like bar bar! 
Neither do I like black sheep.""" > test.in 



[email protected]:~$ python 
>>> from nltk import sent_tokenize 
>>> import itertools 
>>> with open('test.in', 'r') as fin: 
...  for key, group in itertools.groupby(fin, lambda x: x!='\n'): 
...    if key: 
...      print list(group) 
... 
['This is a foo bar sentence,\n', 'that is also a foo bar sentence.\n'] 
["But I don't like foobars.\n", 'Yes you do like bars with foos, no?\n'] 
["I'm not sure whether you like bar bar!\n", 'Neither do I like black sheep.\n'] 

在這之後,如果你想要做一個sent_tokenize或該組內的其他朋克模型:

>>> with open('test.in', 'r') as fin: 
...  for key, group in itertools.groupby(fin, lambda x: x!='\n'): 
...    if key: 
...      paragraph = " ".join(line.strip() for line in group) 
...      print sent_tokenize(paragraph) 
... 
['This is a foo bar sentence, that is also a foo bar sentence.'] 
["But I don't like foobars.", 'Yes you do like bars with foos, no?'] 
["I'm not sure whether you like bar bar!", 'Neither do I like black sheep.'] 

(注:計算上更有效的方法是使用mmap,看到https://stackoverflow.com/a/3915398/610569。但對於我在(〜20000000個令牌)的工作尺寸itertools.groupby就足夠了)

+0

謝謝@alvas,但您的句子標記輸出似乎不保留換行符:/ – duhaime

+0

我的解決方案排序將中斷分組更改爲匹配空行。因爲最後,我認爲'\ n \ n' vs'\ n \ n \ n'應該是一樣的,除非它不同,保留休息可能不值得花費=)@HugoMailhot回答破壞朋克如果使用\ n \ n和'[\ n],分詞器將會是更好的解決方案。*'在你的文字中有所不同。 – alvas

+1

謝謝@alvas!我正在與詩歌合作,需要關心正確顯示詩歌,所以我需要跟蹤文件中的所有'\ n'。再次感謝您的跟進! – duhaime

1

拆分輸入成段,在捕獲正則表達式分裂(返回拍攝的字符串以及):

paras = re.split("(\n\s*\n)", sentences) 

你然後可以將nltk.sent_tokenize()應用於各個段落,並按段落處理結果或將列表平坦化 - 無論哪種方式最適合您的進一步使用。

sents_by_para = [ nltk.sent_tokenize(p) for p in paras ] 
flat = [ sent for par in sents_by_para for sent in par ] 

(似乎sent_tokenize()不裂傷空白,只有字符串,所以沒有必要檢查和處理排除它們。)

如果你特別想有連接到以前的空白一句話,你可以很容易地把它貼回:

collapsed = [] 
for s in flat: 
    if s.isspace() and len(collapsed) > 0: 
     collapsed[-1] += s 
    else: 
     collapsed.append(s) 
+0

這是非常有用的@alexis!謝謝! – duhaime

0

最後,我結束了結合來自@alexis和@HugoMailhot見解,使我能保留的情況下換行符,其中一個段落具有多個句子和/或換行符:

import re, nltk, sys, codecs 
import nltk.tokenize.punkt as pkt 
from nltk import data 

class CustomLanguageVars(pkt.PunktLanguageVars): 

    _period_context_fmt = r""" 
     \S*       # some word material 
     %(SentEndChars)s    # a potential sentence ending 
     \s*      # <-- THIS is what I changed 
     (?=(?P<after_tok> 
      %(NonWord)s    # either other punctuation 
      | 
      (?P<next_tok>\S+)  # <-- Normally you would have \s+ here 
     ))""" 

custom_tokenizer = pkt.PunktSentenceTokenizer(lang_vars=CustomLanguageVars()) 

def sentence_split(s): 
     '''Read in a string and return a list of sentences with linebreaks intact''' 
     paras = re.split("(\n\s*\n)", s) 
     sents_by_para = [custom_tokenizer.tokenize(p) for p in paras ] 
     flat = [ sent for par in sents_by_para for sent in par ] 

     collapsed = [] 
     for s in flat: 
      if s.isspace() and len(collapsed) > 0: 
       collapsed[-1] += s 
      else: 
       collapsed.append(s) 

     return collapsed 

if __name__ == "__main__": 
     s = codecs.open(sys.argv[1],'r','utf-8').read() 
     sentences = sentence_split(s)