2011-06-08 59 views
2

我想使用Python獲取一組文檔的頻率分佈。我的代碼不工作,出於某種原因,併產生此錯誤:FreqDist使用NLTK

Traceback (most recent call last): 
    File "C:\Documents and Settings\aschein\Desktop\freqdist", line 32, in <module> 
    fd = FreqDist(corpus_text) 
    File "C:\Python26\lib\site-packages\nltk\probability.py", line 104, in __init__ 
    self.update(samples) 
    File "C:\Python26\lib\site-packages\nltk\probability.py", line 472, in update 
    self.inc(sample, count=count) 
    File "C:\Python26\lib\site-packages\nltk\probability.py", line 120, in inc 
    self[sample] = self.get(sample,0) + count 
TypeError: unhashable type: 'list' 

你能幫忙嗎?

這是迄今爲止代碼:

import os 
import nltk 
from nltk.probability import FreqDist 


#The stop=words list 
stopwords_doc = open("C:\\Documents and Settings\\aschein\\My Documents\\stopwords.txt").read() 
stopwords_list = stopwords_doc.split() 
stopwords = nltk.Text(stopwords_list) 

corpus = [] 

#Directory of documents 
directory = "C:\\Documents and Settings\\aschein\\My Documents\\comments" 
listing = os.listdir(directory) 

#Append all documents in directory into a single 'document' (list) 
for doc in listing: 
    doc_name = "C:\\Documents and Settings\\aschein\\My Documents\\comments\\" + doc 
    input = open(doc_name).read() 
    input = input.split() 
    corpus.append(input) 

#Turn list into Text form for NLTK 
corpus_text = nltk.Text(corpus) 

#Remove stop-words 
for w in corpus_text: 
    if w in stopwords: 
     corpus_text.remove(w) 

fd = FreqDist(corpus_text) 

回答

1

錯誤說您嘗試使用列表作爲哈希鍵。你可以將它轉換爲元組嗎?

2

兩個想法,我希望至少有助於答案。

首先,對於nltk.text.Text()方法狀態(重點煤礦)的文檔:

A wrapper around a sequence of simple (string) tokens, which is intended to support initial exploration of texts (via the interactive console). Its methods perform a variety of analyses on the text's contexts (e.g., counting, concordancing, collocation discovery), and display the results. If you wish to write a program which makes use of these analyses, then you should bypass the Text class, and use the appropriate analysis function or class directly instead.

所以我不知道短信()是要處理這些數據的方式。在我看來,你會很好地使用列表。

其次,我會提醒你考慮你要求NLTK在這裏執行的計算。在確定頻率分佈之前刪除停用詞表示您的頻率將偏斜;我不明白爲什麼在製表之前刪除了停用詞,而不是在事後查看分發時忽略這些停用詞。 (我想這第二點會比答案的一部分做出更好的查詢/評論,但我覺得值得指出的是比例會有偏差。)根據你打算使用的頻率分佈,這可能會或可能會本身並不是一個問題。

+0

dmh是完全正確的。在NLTK中不需要使用'text()'函數。你的'corpus []'數組,對於執行FreqDist應該沒問題。 – 2012-05-20 16:40:55