我真的很討厭發佈一個關於整個代碼塊的問題,但我一直在這個過去3個小時的工作,我不能包裹我的頭周圍發生的事情。我從具有不同得分值(-2到2)的CSV文件中檢索大約600條推文,反映了對總統候選人的情緒。NLTK情緒分析只返回一個值
但是,當我在任何其他數據上運行此訓練樣本時,只返回一個值(正數)。我檢查了分數是否正確添加,他們是。對於我來說,85,000條推文全部被評爲「積極」,從600多種培訓套餐中獲得的評價是沒有意義的。有人知道這裏發生了什麼嗎?謝謝!
import nltk
import csv
tweets = []
import ast
with open('romney.csv', 'rb') as csvfile:
mycsv = csv.reader(csvfile)
for row in mycsv:
tweet = row[1]
try:
score = ast.literal_eval(row[12])
if score > 0:
print score
print tweet
tweets.append((tweet,"positive"))
elif score < 0:
print score
print tweet
tweets.append((tweet,"negative"))
except ValueError:
tweet = ""
def get_words_in_tweets(tweets):
all_words = []
for (words, sentiment) in tweets:
all_words.extend(words)
return all_words
def get_word_features(wordlist):
wordlist = nltk.FreqDist(wordlist)
word_features = wordlist.keys()
return word_features
def extract_features(document):
document_words = set(document)
features = {}
for word in word_features:
features['contains(%s)' % word] = (word in document_words)
return features
word_features = get_word_features(get_words_in_tweets(tweets))
training_set = nltk.classify.apply_features(extract_features, tweets)
classifier = nltk.NaiveBayesClassifier.train(training_set)
c = 0
with open('usa.csv', "rU") as csvfile:
mycsv = csv.reader(csvfile)
for row in mycsv:
try:
tweet = row[0]
c = c + 1
print classifier.classify(extract_features(tweet.split()))
except IndexError:
tweet = ""
'extract_features'中的'document'參數的類型是什麼? – 2013-02-27 08:09:25
此外,對此不是100%肯定的,但根據NLTK文檔,特徵詞典中的特徵的適當鍵名是'contains-word(%s)',而不是'contains(%s)'。 – 2013-02-27 08:18:54