導入和使用NLTK語料庫

請請，請幫助。我有一個文件夾充滿了我想用NLTK分析的文本文件。我如何將它作爲語料庫導入，然後在其上運行NLTK命令？我下面放在一起的代碼，但它給我這個錯誤：導入和使用NLTK語料庫

raise error, v # invalid expression 
sre_constants.error: nothing to repeat

代碼：

import nltk 
import re 
from nltk.corpus.reader.plaintext import PlaintextCorpusReader 

corpus_root = '/Users/jt/Documents/Python/CRspeeches' 
speeches = PlaintextCorpusReader(corpus_root, '*.txt') 

print "Finished importing corpus" 

words = FreqDist() 

for sentence in speeches.sents(): 
    for word in sentence: 
     words.inc(word.lower()) 

print words["he"] 
print words.freq("he")

來源

2014-09-28 Jolijt Tamanaha

你不會讓我們繼續下去。總之，**你在哪裏**有錯誤？請爲初學者提供完整的錯誤追蹤，然後逐步完成您的程序。您的語料庫是否包含「CRspeeches」目錄中的'.txt'文件？在初始化'演講稿'後，你會用'print（speeches.fileids（））'得到你的文件列表嗎？你能打印一些應該由'speeches.sents（）'返回的句子嗎？ – alexis 2014-09-28 22:03:05

我理解這個問題有一個已知的bug（？也許這是一個功能）做的，這在this answer部分解釋。總之，某些空虛事物的正則表達式會炸燬。

錯誤的來源是你speeches =行。你應該改變它如下：

speeches = PlaintextCorpusReader(corpus_root, r'.*\.txt')

然後一切都會加載和編譯就好了。

來源

2014-09-28 22:10:16 mixedmath

謝謝！完美解決方案 – 2014-09-28 23:20:03

當我使用它時，是否必須繼續加載語料庫，或者現在是否可以在我的nltk腳本的頂部寫入導入語句？ – 2014-09-28 23:20:38

斑點，@mixedmath！但它不是一個錯誤：以'*'開頭的正則表達式格式錯誤。（但是，錯誤信息可能更具信息性。） – alexis 2014-10-01 13:32:50

導入和使用NLTK語料庫

回答

相關問題