2016-02-04 97 views
0

嘿,大家我知道這已經問過幾次了,但我很難用python查找文檔頻率。我試圖找到TF-IDF,然後找到他們和查詢之間的餘弦分數,但我堅持查找文檔頻率。這是我到目前爲止有:使用Python查找文檔頻率

#includes 
import re 
import os 
import operator 
import glob 
import sys 
import math 
from collections import Counter 

#number of command line argument checker 
if len(sys.argv) != 3: 
    print 'usage: ./part3_soln2.py "path to folder in quotation marks" query.txt' 
    sys.exit(1) 

#Read in the directory to the files 
    path = sys.argv[1] 

#Read in the query 
y = sys.argv[2] 
querystart = re.findall(r'\w+', open(y).read().lower()) 
query = [Z for Z in querystart] 
Query_vec = Counter(query) 
print Query_vec 

#counts total number of documents in the directory 
doccounter = len(glob.glob1(path,"*.txt")) 

if os.path.exists(path) and os.path.isfile(y): 
    word_TF = [] 
    word_IDF = {} 
    TFvec = [] 
    IDFvec = [] 

    #this is my attempt at finding IDF 
    for filename in glob.glob(os.path.join(path, '*.txt')): 

     words_IDF = re.findall(r'\w+', open(filename).read().lower()) 

     doc_IDF = [A for A in words_IDF if len(A) >= 3 and A.isalpha()] 

     word_IDF = doc_IDF 

     #psudocode!! 
     """ 
     for key in word_idf: 
      if key in word_idf: 
       word_idf[key] =+1 
      else: 
       word_idf[key] = 1 

    print word_IDF 
    """ 

    #goes to that directory and reads in the files there 
    for filename in glob.glob(os.path.join(path, '*.txt')): 

     words_TF = re.findall(r'\w+', open(filename).read().lower()) 

     #scans each document for words greater or equal to 3 in length 
     doc_TF = [A for A in words_TF if len(A) >= 3 and A.isalpha()] 

     #this assigns values to each term this is my TF for each vector 
     TFvec = Counter(doc_TF) 

     #weighing the Tf with a log function 
     for key in TFvec: 
      TFvec[key] = 1 + math.log10(TFvec[key]) 


    #placed here so I dont get a command line full of text 
    print TFvec 

#Error checker 
else: 
    print "That path does not exist" 

我使用Python 2和到目前爲止,我真的沒有任何想法如何算一個術語多少文件出現在我能找到的文檔總數但我真的很難找到一個術語出現的文檔數量。我只是要創建一個大型字典,它包含所有文檔中的所有術語,這些術語稍後可能在查詢需要這些術語時提取。感謝您給我的任何幫助。

+1

是否有一個原因,你試圖自己實現這個而不是使用庫:http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html –

+0

我讀了一個但是我必須記錄tf和idf值,並認爲如果我自己實現它,會更容易。另外,我將在一個包含大約100個文本文件的目錄中閱讀,所以我再次認爲它比使用scikit更容易 – Sean

+0

此外,我將不得不在晚些時候爲tfidf做cosin。 scikit也有這個功能嗎? – Sean

回答

2

DF的術語x是出現x的文檔的數量。爲了找到這個問題,你需要先遍歷所有的文件。只有這樣你才能從DF中計算IDF。

您可以使用字典,用於計算DF:

  1. 遍歷所有文件
  2. 對於每一個文檔,檢索設定它的話(不重複)
  3. 增加DF計算每個字從第2階段開始。因此,無論該單詞在文檔中有多少次,您都會將計數增加1。

Python代碼看起來是這樣的:

from collections import defaultdict 
import math 

DF = defaultdict(int) 
for filename in glob.glob(os.path.join(path, '*.txt')): 
    words = re.findall(r'\w+', open(filename).read().lower()) 
    for word in set(words): 
     if len(word) >= 3 and word.isalpha(): 
      DF[word] += 1 # defaultdict simplifies your "if key in word_idf: ..." part. 

# Now you can compute IDF. 
IDF = dict() 
for word in DF: 
    IDF[word] = math.log(doccounter/float(DF[word])) # Don't forget that python2 uses integer division. 

PS這是很好的學習手工實現的事情,但如果您遇到困難,我建議你看看NLTK包。它提供了用於處理語料庫(文本集合)的有用功能。

+0

非常感謝,昨天有人向我推薦了默認字典,但我不知道如何使用它。 – Sean