FreqDist使用NLTK

我想使用Python獲取一組文檔的頻率分佈。我的代碼不工作，出於某種原因，併產生此錯誤：FreqDist使用NLTK

Traceback (most recent call last): 
    File "C:\Documents and Settings\aschein\Desktop\freqdist", line 32, in <module> 
    fd = FreqDist(corpus_text) 
    File "C:\Python26\lib\site-packages\nltk\probability.py", line 104, in __init__ 
    self.update(samples) 
    File "C:\Python26\lib\site-packages\nltk\probability.py", line 472, in update 
    self.inc(sample, count=count) 
    File "C:\Python26\lib\site-packages\nltk\probability.py", line 120, in inc 
    self[sample] = self.get(sample,0) + count 
TypeError: unhashable type: 'list'

你能幫忙嗎？

這是迄今爲止代碼：

import os 
import nltk 
from nltk.probability import FreqDist 


#The stop=words list 
stopwords_doc = open("C:\\Documents and Settings\\aschein\\My Documents\\stopwords.txt").read() 
stopwords_list = stopwords_doc.split() 
stopwords = nltk.Text(stopwords_list) 

corpus = [] 

#Directory of documents 
directory = "C:\\Documents and Settings\\aschein\\My Documents\\comments" 
listing = os.listdir(directory) 

#Append all documents in directory into a single 'document' (list) 
for doc in listing: 
    doc_name = "C:\\Documents and Settings\\aschein\\My Documents\\comments\\" + doc 
    input = open(doc_name).read() 
    input = input.split() 
    corpus.append(input) 

#Turn list into Text form for NLTK 
corpus_text = nltk.Text(corpus) 

#Remove stop-words 
for w in corpus_text: 
    if w in stopwords: 
     corpus_text.remove(w) 

fd = FreqDist(corpus_text)

來源

2011-06-08 AJS

錯誤說您嘗試使用列表作爲哈希鍵。你可以將它轉換爲元組嗎？

來源

2011-06-08 20:56:00 zhizhong

兩個想法，我希望至少有助於答案。

首先，對於nltk.text.Text（）方法狀態（重點煤礦）的文檔：

A wrapper around a sequence of simple (string) tokens, which is intended to support initial exploration of texts (via the interactive console). Its methods perform a variety of analyses on the text's contexts (e.g., counting, concordancing, collocation discovery), and display the results. If you wish to write a program which makes use of these analyses, then you should bypass the Text class, and use the appropriate analysis function or class directly instead.

所以我不知道短信（）是要處理這些數據的方式。在我看來，你會很好地使用列表。

其次，我會提醒你考慮你要求NLTK在這裏執行的計算。在確定頻率分佈之前刪除停用詞表示您的頻率將偏斜;我不明白爲什麼在製表之前刪除了停用詞，而不是在事後查看分發時忽略這些停用詞。（我想這第二點會比答案的一部分做出更好的查詢/評論，但我覺得值得指出的是比例會有偏差。）根據你打算使用的頻率分佈，這可能會或可能會本身並不是一個問題。

來源

2011-06-09 06:45:51 dmh

dmh是完全正確的。在NLTK中不需要使用'text（）'函數。你的'corpus []'數組，對於執行FreqDist應該沒問題。 – 2012-05-20 16:40:55

FreqDist使用NLTK

回答

相關問題