Python（TextBlob）TF-IDF計算

我已經看過使用Python計算文檔中單詞TF-IDF分數的幾種方法。我選擇使用TextBlob。Python（TextBlob）TF-IDF計算

我得到一個輸出，但是，它們是負值。我知道這是不正確的（非負數量（tf）除以正數（df）（的對數）的log（log）不會產生負數值）。

我看過這裏發佈的以下問題：TFIDF calculating confusion但它沒有幫助。

我是如何在計算分數：

def tf(word, blob): 
     return blob.words.count(word)/len(blob.words) 

def n_containing(word, bloblist): 
     return sum(1 for blob in bloblist if word in blob) 

def idf(word, bloblist): 
     return math.log(len(bloblist)/(1 + n_containing(word, bloblist))) 

def tfidf(word, blob, bloblist): 
     return tf(word, blob) * idf(word, bloblist)

然後我簡單地打印出他們的成績的話。

"hello, this is a test. a test is always good." 


    Top words in document 
    Word: good, TF-IDF: -0.06931 
    Word: this, TF-IDF: -0.06931 
    Word: always, TF-IDF: -0.06931 
    Word: hello, TF-IDF: -0.06931 
    Word: a, TF-IDF: -0.13863 
    Word: is, TF-IDF: -0.13863 
    Word: test, TF-IDF: -0.13863

與小知識，我有什麼，我所看到的，它可能是以色列國防軍計算不正確？

所有幫助將不勝感激。感謝

來源

2015-09-07 user47467

日誌x的如果0 yurib

@yurib值不能是負的，因爲它們在文檔中存在... – user47467

我同意tfidf評分不應該是負面的，我在技術上指出，你的實施可以返回一個負面結果。例如，如果一個單詞出現在所有blob中，則idf（）將返回log（len（bloblist）/（len（bloblist）+1）），這將是負面的。 – yurib

與具有很難查明原因的輸入/輸出示例出，一個可能的嫌疑人可能是idf()方法，該方法將返回的情況下word出現在每個blob負值。這是因爲分母中的+1，我認爲這是爲了避免被零除。
一個可能的解決方法可能是零明確的檢查：

def idf(word, bloblist): 
    x = n_containing(word, bloblist) 
    return math.log(len(bloblist)/(x if x else 1))

注意，在這種情況下，出現在只有一個斑點，在任何斑點都將返回相同的值，你可以找到另一種解決方案，以適應你的需求，只記得不要拿分數的log。

來源

2015-09-07 14:17:29 yurib

沒想到！我用一個例子編輯過。 – user47467

@ user47467那麼這正是我描述的問題，你只有一個blob，所以每個單詞出現在'所有'blob中，並且記錄一小部分的日誌.... – yurib

您的方法爲每個單詞生成0分：/ – user47467

-1

IDF得分應該是非負數。問題出在idf函數實現中。

嘗試此代替：

from __future__ import division 
from textblob import TextBlob 
import math 

def tf(word, blob): 
     return blob.words.count(word)/len(blob.words) 

def n_containing(word, bloblist): 
    return 1 + sum(1 for blob in bloblist if word in blob) 

def idf(word, bloblist): 
    return math.log(float(1+len(bloblist))/float(n_containing(word,bloblist))) 

def tfidf(word, blob, bloblist): 
    return tf(word, blob) * idf(word, bloblist) 

text = 'tf–idf, short for term frequency–inverse document frequency' 
text2 = 'is a numerical statistic that is intended to reflect how important' 
text3 = 'a word is to a document in a collection or corpus' 

blob = TextBlob(text) 
blob2 = TextBlob(text2) 
blob3 = TextBlob(text3) 
bloblist = [blob, blob2, blob3] 
tf_score = tf('short', blob) 
idf_score = idf('short', bloblist) 
tfidf_score = tfidf('short', blob, bloblist) 
print tf_score, idf_score, tfidf_score

來源

2015-09-07 14:58:40

你能解釋一下tf_score = tf（'movie'，blob）嗎？ – user47467

當然。術語頻率計算文檔中單詞的出現。所以它基本上計算了在名爲blob的文檔中出現了多少次單詞電影。 –

Idf otoh用於懲罰所有文件中非常常見的詞。如'the'，'a'等。 –

Python（TextBlob）TF-IDF計算

回答

相關問題