2017-03-15 48 views
1

我無法獲得PlaintextCorpusReader中的paras和sents功能。這裏是我有的代碼:nltk PlaintextCorpusReader sents和paras功能不起作用

import nltk 
from nltk.corpus import PlaintextCorpusReader 

corpus_root = './dir_root' 
newcorpus = PlaintextCorpusReader(corpus_root, '.*') # Files you want to add 

word_list = newcorpus.words('file1.txt') 
sentence_list = newcorpus.sents('file1.txt') 
paragraph_list = newcorpus.paras('file1.txt') 

print(word_list) 
print(sentence_list) 
print(paragraph_list) 

word_list出來罰款。

['__________________________________________________________________', 'Title', ...] 

但是,paragraph_list和sentence_list均可以得到這樣的錯誤:

Traceback (most recent call last): 
    File "corpus.py", line 13, in <module> 
    print(sentence_list) 
    File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/collections.py", line 225, in __repr__ 
    for elt in self: 
    File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/corpus/reader/util.py", line 296, in iterate_from 
    tokens = self.read_block(self._stream) 
    File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/corpus/reader/plaintext.py", line 129, in _read_sent_block 
    for sent in self._sent_tokenizer.tokenize(para)]) 
    File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/data.py", line 956, in __getattr__ 
    self.__load() 
    File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/data.py", line 948, in __load 
    resource = load(self._path) 
    File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/data.py", line 808, in load 
    opened_resource = _open(resource_url) 
    File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/data.py", line 926, in _open 
    return find(path_, path + ['']).open() 
    File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/data.py", line 648, in find 
    raise LookupError(resource_not_found) 
LookupError: 
********************************************************************** 
    Resource 'tokenizers/punkt/PY3/english.pickle' not found. 
    Please use the NLTK Downloader to obtain the resource: >>> 
    nltk.download() 
    Searched in: 
    - '/Users/username/nltk_data' 
    - '/usr/share/nltk_data' 
    - '/usr/local/share/nltk_data' 
    - '/usr/lib/nltk_data' 
    - '/usr/local/lib/nltk_data' 
    - '' 
********************************************************************** 

我嘗試使用nltk.download()下載文件到語料庫,但也不能工作。另外它看起來並不像它應該的工作方式,因爲PlaintextCorpusReader已經做到了。 分機功能是分開的PlaintextCorpusReader。是否有需要輸入的特定字段?或者,是否有某種正則表達式需要它來查找句子或段落? documentationsource code似乎沒有說它需要任何東西比單詞功能。

回答

3

您錯過了句子標記器所需的數據文件(「資源」)。通過在交互式下載器下載「PUNKT」資源「模式」下,還是非交互通過一次運行該代碼修復問題:

nltk.download("punkt") 

爲了避免遇到這樣那樣的問題多次被評爲你探索NLTK ,我建議現在下載「書籍」包。它包含了你可能需要一段時間的一切。