2016-03-02 87 views
0

我正在嘗試利用NLTK對一批文件執行術語頻率(TF)和逆文檔頻率(IDF)分析(它們恰好是企業新聞來自IBM的發佈)。我知道,NLTK是否有TF IDF功能has been disputed on SO beforehand,但我發現斷言指示模塊文檔確實有他們:查找期限頻率和反向文檔頻率利用NLTK(Python 3.5)

http://www.nltk.org/_modules/nltk/text.html

http://www.nltk.org/api/nltk.html#nltk.text.TextCollection

我從來沒有見過或用過「self」或init以預先執行代碼。這是我迄今爲止所擁有的。任何關於如何修改此代碼的建議非常感謝。我目前所擁有的東西沒有任何回報。我不太瞭解NLTK文檔中「源」,「自我」或「詞語」和「文本」的含義。

import nltk.corpus 
from nltk.text import TextCollection 
from nltk.corpus import gutenberg 
gutenberg.fileids() 

ibm1 = gutenberg.words('ibm-github.txt') 
ibm2 = gutenberg.words('ibm-alior.txt') 

mytexts = TextCollection([ibm1, ibm2]) 
term = 'software' 

def __init__(self, source): 
    if hasattr(source, 'words'): 
     source = [source.words(f) for f in source.fileids()] 

    self._texts = source 
    Text.__init__(self, LazyConcatenation(source)) 
    self._idf_cache = {} 

def tf(self, term, mytexts): 
    result = mytexts.count(term)/len(mytexts) 
    print(result) 

回答

1
from nltk.text import TextCollection 
from nltk.book import text1, text2, text3 

mytexts = TextCollection([text1, text2, text3]) 

# Print the IDF of a word 
print(mytexts.idf("Moby")) 

# tf_idf 
print(mytexts.tf_idf("Moby", text1))