蟒蛇 - 如何計算不同tweeets中前100個單詞的最高tf-idf值

我在一個.txt文件中保存了幾十個tweet，我想計算第一個tf-idf值的最高值換句話說，我想比較不同推文之間的單詞tf-idf值，目前，我可以完成的唯一事情是在同一推文中比較單詞的tf-idf值，我找不到方法比較不同推文之間的單詞的tf-idf值。蟒蛇 - 如何計算不同tweeets中前100個單詞的最高tf-idf值

請幫幫我，我一直因爲這個問題而心煩很久。 /（ㄒØㄒ）/ ~~

吹是我的代碼：（只能夠計算在同一鳴叫術語的TFIDF值）

with open('D:/Data/ows/ows_sample.txt','rb') as f: 
    tweet=f.readlines() 
lines = csv.reader((line.replace('\x00','') for line in tweet), delimiter=',', quotechar='"') 
wordterm=[] 
for i in lines: 
    i[1]= re.sub(r'http[s]?://(?:[a-z]|[0-9]|[[email protected]&+]|[!*\(\),]|(?:%[0-9a-f][0-9a-f]))+|(?:@[\w_]+)', "", i[1]) 
    tweets=re.split(r"\W+",i[1]) 
    tweets=[w.lower() for w in tweets if w!=""] 
    stopwords = open("D:/Data/ows/stopwords.txt", "r").read().split() 
    terms = [t for t in tweets if not t in stopwords] 
    wordterm.append(terms) 

word=[' '.join(t) for t in wordterm] 
tfidf_vectorizer = TfidfVectorizer(min_df = 1,use_idf=True) 
tfidf_matrix = tfidf_vectorizer.fit_transform(word) 
terms_name = tfidf_vectorizer.get_feature_names() 
toarry=tfidf_matrix.todense() 

#below code will output the tf-idf value of each tweets' terms. 
for ii in range(0,len(toarry)): 
    print u"第"+ ii +u"個tweets" 
    for jj in range(0,len(terms_name)): 
     print terms_name[jj],'-',tfidf_matrix[ii,jj]

來源

2016-07-13 qiang qin

現在，我明白你的問題，我會嘗試更好地回答你的問題。

要在所有推文中以可比較的方式獲得前100名'tf-idf'分數，要麼意味着您放棄了存在不同推文的概念，要麼可以比較tf-idf分數相同的單詞。

因此，對於第一種情況，想象你的所有單詞都在1'文檔'中。這基本上可以消除tf-idf的'idf'方面，你會得到的基本上是一個字數矢量化器，它可以相互比較，你可以通過這種方式得到前100個單詞。

words = ['the cat sat on the mat cat cat'] 
tfidf_vectorizer = TfidfVectorizer(min_df = 1,use_idf=True) 
tfidf_matrix = tfidf_vectorizer.fit_transform(words) 
terms_name = tfidf_vectorizer.get_feature_names() 
toarry=tfidf_matrix.todense() 

toarry: 
    matrix([ .75, 0.25, 0.25, 0.25, 0.5])

另一種情況是，您分別採取每個推文，然後比較分數與他們的tf-idf分數。這將導致具有不同分數的相同詞語，因爲這就是tf-idf所做的 - 它計算文檔中詞語相對於語料庫的重要性。

words = ['the cat sat on the mat cat', 'the fat rat sat on a mat', 'the bat and a rat sat on a mat'] 
tfidf_vectorizer = TfidfVectorizer(min_df = 1,use_idf=True) 
tfidf_matrix = tfidf_vectorizer.fit_transform(words) 
terms_name = tfidf_vectorizer.get_feature_names() 
toarry=tfidf_matrix.todense() 
for i in tfidf_matrix.toarray(): 
    print zip(terms_name, i) 

[(u'and', 0.0), (u'bat', 0.0), (u'cat', 0.78800079617844954), (u'fat', 0.0), (u'mat', 0.23270298212286766), (u'on', 0.23270298212286766), (u'rat', 0.0), (u'sat', 0.23270298212286766), (u'the', 0.46540596424573533)] 
[(u'and', 0.0), (u'bat', 0.0), (u'cat', 0.0), (u'fat', 0.57989687146162439), (u'mat', 0.34249643393071422), (u'on', 0.34249643393071422), (u'rat', 0.44102651785124652), (u'sat', 0.34249643393071422), (u'the', 0.34249643393071422)] 
[(u'and', 0.50165133177159349), (u'bat', 0.50165133177159349), (u'cat', 0.0), (u'fat', 0.0), (u'mat', 0.29628335772067432), (u'on', 0.29628335772067432), (u'rat', 0.38151876810273028), (u'sat', 0.29628335772067432), (u'the', 0.29628335772067432)]

正如你可以在結果中看到，同樣的話就會有不同的分數在每個文檔中，因爲TF-IDF是在每個文檔中該詞的得分。所以這些是你可以使用的兩種方法，所以根據你的需要，你可以選擇什麼更適合你的目的。

來源

2016-07-13 19:26:46

這應該是一個評論，而不是回答 –

是啊，但看起來我需要50個聲望來評論... –

啊，我認爲這是10.嗯，最好發佈一些*實際*答案;）（我無論如何標記爲一個國防部，所以也許他們可以將它轉換爲您的評論） –

蟒蛇 - 如何計算不同tweeets中前100個單詞的最高tf-idf值

回答

相關問題