我在一個.txt文件中保存了幾十個tweet,我想計算第一個tf-idf值的最高值換句話說,我想比較不同推文之間的單詞tf-idf值,目前,我可以完成的唯一事情是在同一推文中比較單詞的tf-idf值,我找不到方法比較不同推文之間的單詞的tf-idf值。蟒蛇 - 如何計算不同tweeets中前100個單詞的最高tf-idf值
請幫幫我,我一直因爲這個問題而心煩很久。 /(ㄒØㄒ)/ ~~
吹是我的代碼:(只能夠計算在同一鳴叫術語的TFIDF值)
with open('D:/Data/ows/ows_sample.txt','rb') as f:
tweet=f.readlines()
lines = csv.reader((line.replace('\x00','') for line in tweet), delimiter=',', quotechar='"')
wordterm=[]
for i in lines:
i[1]= re.sub(r'http[s]?://(?:[a-z]|[0-9]|[[email protected]&+]|[!*\(\),]|(?:%[0-9a-f][0-9a-f]))+|(?:@[\w_]+)', "", i[1])
tweets=re.split(r"\W+",i[1])
tweets=[w.lower() for w in tweets if w!=""]
stopwords = open("D:/Data/ows/stopwords.txt", "r").read().split()
terms = [t for t in tweets if not t in stopwords]
wordterm.append(terms)
word=[' '.join(t) for t in wordterm]
tfidf_vectorizer = TfidfVectorizer(min_df = 1,use_idf=True)
tfidf_matrix = tfidf_vectorizer.fit_transform(word)
terms_name = tfidf_vectorizer.get_feature_names()
toarry=tfidf_matrix.todense()
#below code will output the tf-idf value of each tweets' terms.
for ii in range(0,len(toarry)):
print u"第"+ ii +u"個tweets"
for jj in range(0,len(terms_name)):
print terms_name[jj],'-',tfidf_matrix[ii,jj]
這應該是一個評論,而不是回答 –
是啊,但看起來我需要50個聲望來評論... –
啊,我認爲這是10.嗯,最好發佈一些*實際*答案;)(我無論如何標記爲一個國防部,所以也許他們可以將它轉換爲您的評論) –