我使用NLTK的PUNKT句子標記生成器到文件分割成句子的列表，並希望在文件中保存的空行：保留空行與NLTK的PUNKT標記者

from nltk import data 
tokenizer = data.load('tokenizers/punkt/english.pickle') 
s = "That was a very loud beep.\n\n I don't even know\n if this is working. Mark?\n\n Mark are you there?\n\n\n" 
sentences = tokenizer.tokenize(s) 
print sentences

我會這樣的打印：

['That was a very loud beep.\n\n', "I don't even know\n if this is working.", 'Mark?\n\n', 'Mark are you there?\n\n\n']

但是，這實際打印表明尾隨空行已經從第一和第三句刪除內容：

['That was a very loud beep.', "I don't even know\n if this is working.", 'Mark?', 'Mark are you there?\n\n\n']

Other tokenizers在NLTK有一個blanklines='keep'參數，但我沒有看到任何這樣的選項在Punkt標記器的情況下。這很可能我錯過了一些簡單的東西。有沒有辦法使用Punkt語句標記器重新訓練這些尾隨的空行？我會很感激別人可以提供的任何見解！

來源

2015-10-15 duhaime

無論NLTK使用，你可以預裂的換行符（多個新行），然後將文本使用NLTK對得到的塊 –

@VsevolodDyomkin有趣的想法;在那種情況下，如何處理分散在多行中的句子？ – duhaime

對於這種情況它只是不起作用:( –

問題

可悲的是，你不能讓標記生成器保持blanklines，不與編寫程序的方式。

Starting here並按照功能通過span_tokenize（）和_slices_from_text（），你可以看到有一個條件

if match.group('next_tok'):

，旨在確保標記生成器跳過空白，直到下一個可能的話開始呼籲令牌發生。尋找這個引用的正則表達式，我們最終看着_period_context_fmt，我們看到next_tok命名組前面有\s+，其中空白行將不會被捕獲。

解決方案

分解，更改你不喜歡，你重新裝定製解決方案的一部分。

現在這個正則表達式在PunktLanguageVars類中，它本身用於初始化PunktSentenceTokenizer類。我們只需從PunktLanguageVars派生出一個自定義類，然後按照我們希望的方式修復這個正則表達式。

我們要修復的方法是包括在一個句子的末尾尾隨換行，所以我建議更換_period_context_fmt，從這個打算：

_period_context_fmt = r""" 
    \S*       # some word material 
    %(SentEndChars)s    # a potential sentence ending 
    (?=(?P<after_tok> 
     %(NonWord)s    # either other punctuation 
     | 
     \s+(?P<next_tok>\S+)  # or whitespace and some other token 
    ))"""

這樣：

_period_context_fmt = r""" 
    \S*       # some word material 
    %(SentEndChars)s    # a potential sentence ending 
    \s*      # <-- THIS is what I changed 
    (?=(?P<after_tok> 
     %(NonWord)s    # either other punctuation 
     | 
     (?P<next_tok>\S+)  # <-- Normally you would have \s+ here 
    ))"""

現在使用此正則表達式代替舊版本的標記器將在句子結束後包含0個或更多\s個字符。

整個腳本

import nltk.tokenize.punkt as pkt 

class CustomLanguageVars(pkt.PunktLanguageVars): 

    _period_context_fmt = r""" 
     \S*       # some word material 
     %(SentEndChars)s    # a potential sentence ending 
     \s*      # <-- THIS is what I changed 
     (?=(?P<after_tok> 
      %(NonWord)s    # either other punctuation 
      | 
      (?P<next_tok>\S+)  # <-- Normally you would have \s+ here 
     ))""" 

custom_tknzr = pkt.PunktSentenceTokenizer(lang_vars=CustomLanguageVars()) 

s = "That was a very loud beep.\n\n I don't even know\n if this is working. Mark?\n\n Mark are you there?\n\n\n" 

print(custom_tknzr.tokenize(s))

此輸出：

['That was a very loud beep.\n\n ', "I don't even know\n if this is working. ", 'Mark?\n\n ', 'Mark are you there?\n\n\n']

來源

2015-10-15 16:11:03 HugoMailhot

@duhaime，我將我的解決方案腳本改爲非冗餘的，因爲我們只需要重新定義正則表達式，沒有必要重新定義使用它的方法。乾杯！ – HugoMailhot

這絕對是完美的。你的代碼片段教會了我很多關於NLTK的繼承。你呢！ – duhaime

我會itertools.groupby去，看Python: How to loop through blocks of lines：

[email protected]:~$ echo """This is a foo bar sentence, 
that is also a foo bar sentence. 

But I don't like foobars. 
Yes you do like bars with foos, no? 


I'm not sure whether you like bar bar! 
Neither do I like black sheep.""" > test.in 



[email protected]:~$ python 
>>> from nltk import sent_tokenize 
>>> import itertools 
>>> with open('test.in', 'r') as fin: 
...  for key, group in itertools.groupby(fin, lambda x: x!='\n'): 
...    if key: 
...      print list(group) 
... 
['This is a foo bar sentence,\n', 'that is also a foo bar sentence.\n'] 
["But I don't like foobars.\n", 'Yes you do like bars with foos, no?\n'] 
["I'm not sure whether you like bar bar!\n", 'Neither do I like black sheep.\n']

在這之後，如果你想要做一個sent_tokenize或該組內的其他朋克模型：

>>> with open('test.in', 'r') as fin: 
...  for key, group in itertools.groupby(fin, lambda x: x!='\n'): 
...    if key: 
...      paragraph = " ".join(line.strip() for line in group) 
...      print sent_tokenize(paragraph) 
... 
['This is a foo bar sentence, that is also a foo bar sentence.'] 
["But I don't like foobars.", 'Yes you do like bars with foos, no?'] 
["I'm not sure whether you like bar bar!", 'Neither do I like black sheep.']

（注：計算上更有效的方法是使用mmap，看到https://stackoverflow.com/a/3915398/610569。但對於我在（〜20000000個令牌）的工作尺寸itertools.groupby就足夠了）

來源

2015-10-15 18:22:52 alvas

謝謝@alvas，但您的句子標記輸出似乎不保留換行符：/ – duhaime

我的解決方案排序將中斷分組更改爲匹配空行。因爲最後，我認爲'\ n \ n' vs'\ n \ n \ n'應該是一樣的，除非它不同，保留休息可能不值得花費=）@HugoMailhot回答破壞朋克如果使用\ n \ n和'[\ n]，分詞器將會是更好的解決方案。*'在你的文字中有所不同。 – alvas

謝謝@alvas！我正在與詩歌合作，需要關心正確顯示詩歌，所以我需要跟蹤文件中的所有'\ n'。再次感謝您的跟進！ – duhaime

拆分輸入成段，在捕獲正則表達式分裂（返回拍攝的字符串以及）：

paras = re.split("(\n\s*\n)", sentences)

你然後可以將nltk.sent_tokenize()應用於各個段落，並按段落處理結果或將列表平坦化 - 無論哪種方式最適合您的進一步使用。

sents_by_para = [ nltk.sent_tokenize(p) for p in paras ] 
flat = [ sent for par in sents_by_para for sent in par ]

（似乎sent_tokenize()不裂傷空白，只有字符串，所以沒有必要檢查和處理排除它們。）

如果你特別想有連接到以前的空白一句話，你可以很容易地把它貼回：

collapsed = [] 
for s in flat: 
    if s.isspace() and len(collapsed) > 0: 
     collapsed[-1] += s 
    else: 
     collapsed.append(s)

來源

2015-10-16 23:39:43 alexis

這是非常有用的@alexis！謝謝！ – duhaime

最後，我結束了結合來自@alexis和@HugoMailhot見解，使我能保留的情況下換行符，其中一個段落具有多個句子和/或換行符：

import re, nltk, sys, codecs 
import nltk.tokenize.punkt as pkt 
from nltk import data 

class CustomLanguageVars(pkt.PunktLanguageVars): 

    _period_context_fmt = r""" 
     \S*       # some word material 
     %(SentEndChars)s    # a potential sentence ending 
     \s*      # <-- THIS is what I changed 
     (?=(?P<after_tok> 
      %(NonWord)s    # either other punctuation 
      | 
      (?P<next_tok>\S+)  # <-- Normally you would have \s+ here 
     ))""" 

custom_tokenizer = pkt.PunktSentenceTokenizer(lang_vars=CustomLanguageVars()) 

def sentence_split(s): 
     '''Read in a string and return a list of sentences with linebreaks intact''' 
     paras = re.split("(\n\s*\n)", s) 
     sents_by_para = [custom_tokenizer.tokenize(p) for p in paras ] 
     flat = [ sent for par in sents_by_para for sent in par ] 

     collapsed = [] 
     for s in flat: 
      if s.isspace() and len(collapsed) > 0: 
       collapsed[-1] += s 
      else: 
       collapsed.append(s) 

     return collapsed 

if __name__ == "__main__": 
     s = codecs.open(sys.argv[1],'r','utf-8').read() 
     sentences = sentence_split(s)

來源

2015-10-31 20:23:30 duhaime

保留空行與NLTK的PUNKT標記者

回答

問題

解決方案

整個腳本

相關問題