語料庫的逆文檔頻率

我有一個包含10個txt文件的文件夾。我正在計算給定術語的IDF。但是我的產出與預期不同。這是我的idf代碼。語料庫的逆文檔頻率

這裏s是一個包含來自這10個文件的所有單詞的聯合的集合。

def idf(term): 
    i = 0 
    doc_counts = 0 
    totaldocs = 10 
    if term in s: 
     for filename in os.listdir(root_of_my_corpus): 
      file = open(os.path.join(root_of_my_corpus, filename), "r", encoding='UTF-8') 
      idfdoc = file.read() 
      file.close() 
      idfdoc = idfdoc.lower() 
      tokenidf = tokenizer.tokenize(idfdoc) 
      if term in tokenidf: 
       doc_counts+=1 
    return(math.log(totaldocs/doc_counts))

來源

2016-03-02 Sameer

您可以提供輸出以及預期的輸出，也可以提供一些示例數據？ –

假設一個術語='xyz'在7個文檔中重複，確切的idf值沒有被我的代碼返回。 – Sameer

這還沒有足夠的信息。例如，你的程序中有什麼？爲什麼'totaldocs = 10'，而不是'root_of_my_corpus'中的文件數量？ –

我只寫了一個如何計算idf的小演示。我使用的玩具數據有四個txt文件，如下

1.txt的內容：「世界，你好1」
2.txt內容：「你好世界2」
3.txt內容：「你好世界3"
4.txt內容：‘你好世界4’

代碼基本上是加載所有的txt內容到一個字典，然後計算IDF對每個字。下面是代碼：

import os 
import math 
from collections import defaultdict 


def idf_calc(path): 
    # load data 
    file_paths = [(path + item, str(item.split(".")[0])) for item in os.listdir(path)] 
    contents = {} 
    for item in file_paths: 
     file_path, file_name = item 
     raw = "" 
     with open(file_path, "r") as fp: 
      data = fp.readlines() 
      if len(data) > 0: 
       raw = data[0].strip() 
     contents[file_name] = raw 


    # idf calculate 
    result = {} 
    total_cnt = len(contents) 
    words = list(set([word for item in contents for word in contents[item].split()])) 

    for i, word in enumerate(words): 
     cnt = sum([1 for item in contents if word in contents[item]]) 
     idf = math.log(total_cnt/cnt) 
     result[word] = "%.3f" % (idf) 

    print result 


idf_calc("../data/txt/")

結果

{'1': '1.386', '3': '1.386', '2': '1.386', '4': '1.386', 'world': '0.000', 'Hello': '0.000'}

希望它能幫助。

來源

2016-03-02 01:26:09 Eric

項目的其他部分是獲得單詞的數量，但如果我使用大型語料庫，數量會有所不同。代碼： listwords = [] 爲文件名在os.listdir（corpus_root）：文件=打開（os.path.join（corpus_root，文件名），「R」，編碼= 'UTF-8'） listdoc = file.read（） file.close（） listdoc = listdoc.lower（） tokensget = tokenizer.tokenize（listdoc）爲w的tokensget：當w不在STOP_WORDS： listwords.append（W） #print（sorted（listwords）） def getcount（chkword）： countw = 0 for listwords： if w == chkword： countw = cou ntw + 1 return countw – Sameer

上面的評論很難讀 – Eric

項目的其他部分是獲得單詞的數量，但如果我使用大型語料庫，數量會有所不同。代碼：對於os.listdir（corpus_root）中的文件名，listwords = [] ： file = open（os.path.join（corpus_root，filename），「r」，encoding ='UTF-8'）listdoc = file。讀（） file.close（） listdoc = listdoc.lower（） tokensget = tokenizer.tokenize（listdoc）爲w的tokensget：當w不在STOP_WORDS： listwords。追加（W） DEF getCount將（chkword）： COUNT周= 0 爲w的listwords：當w == chkword： COUNT周= COUNT周+ 1 返回COUNT周 – Sameer

語料庫的逆文檔頻率

回答

相關問題