我使用NLTK的PUNKT句子標記生成器到文件分割成句子的列表,並希望在文件中保存的空行:保留空行與NLTK的PUNKT標記者
from nltk import data
tokenizer = data.load('tokenizers/punkt/english.pickle')
s = "That was a very loud beep.\n\n I don't even know\n if this is working. Mark?\n\n Mark are you there?\n\n\n"
sentences = tokenizer.tokenize(s)
print sentences
我會這樣的打印:
['That was a very loud beep.\n\n', "I don't even know\n if this is working.", 'Mark?\n\n', 'Mark are you there?\n\n\n']
但是,這實際打印表明尾隨空行已經從第一和第三句刪除內容:
['That was a very loud beep.', "I don't even know\n if this is working.", 'Mark?', 'Mark are you there?\n\n\n']
Other tokenizers在NLTK有一個blanklines='keep'
參數,但我沒有看到任何這樣的選項在Punkt標記器的情況下。這很可能我錯過了一些簡單的東西。有沒有辦法使用Punkt語句標記器重新訓練這些尾隨的空行?我會很感激別人可以提供的任何見解!
無論NLTK使用,你可以預裂的換行符(多個新行),然後將文本使用NLTK對得到的塊 –
@VsevolodDyomkin有趣的想法;在那種情況下,如何處理分散在多行中的句子? – duhaime
對於這種情況它只是不起作用:( –