從一組文檔

我有一套3000個文本文檔中提取最重要的關鍵詞，我想提取300強的關鍵詞（可以是單個詞或多個單詞）。從一組文檔

我曾嘗試下面的方法 -

RAKE：這是一個基於Python的關鍵詞提取庫，無疾而終。

Tf-Idf：它給了我每個文檔好的關鍵字，但我們不我能夠聚集並找到代表的文件全組關鍵字。另外，僅僅根據Tf-Idf得分從每個文檔中選擇前k個單詞將無濟於事，對吧？

Word2vec：我能夠做一些很酷的東西，如發現類似的話，但不知道如何使用它來尋找重要的關鍵字。

能否請您推薦一些好的方法（或闡述如何提高任何上述3）來解決這個問題呢？謝謝:)

來源

2017-08-24 Vijender

是更好地爲您手動選擇那些300個字（它不是這麼多，是一個時間） - 編寫的代碼在Python 3

import os 
files = os.listdir() 
topWords = ["word1", "word2.... etc"] 
wordsCount = 0 
for file in files: 
     file_opened = open(file, "r") 
     lines = file_opened.read().split("\n") 
     for word in topWords: 
       if word in lines and wordsCount < 301: 
           print("I found %s" %word) 
           wordsCount += 1 
     #Check Again wordsCount to close first repetitive instruction 
     if wordsCount == 300: 
       break

來源

2017-08-24 12:21:41 durduliu2009

-1

import os 
import operator 
from collections import defaultdict 
files = os.listdir() 
topWords = ["word1", "word2.... etc"] 
wordsCount = 0 
words = defaultdict(lambda: 0) 
for file in files: 
    open_file = open(file, "r") 
    for line in open_file.readlines(): 
     raw_words = line.split() 
     for word in raw_words: 
      words[word] += 1 
sorted_words = sorted(words.items(), key=operator.itemgetter(1))

現在就頂300從排序的話，他們是你想要的話。

來源

2017-08-24 13:13:42

謝謝@Awaish，但我也試過這個。這種方法的結果很差，因爲重要的術語只出現一次或兩次。如果我嘗試根據頻率對Tf-idf術語進行排序和選擇，會出現許多常見和不相關的術語。 – Vijender

最簡單有效的方法申請最重要的詞的TF-IDF實現。如果您有停用詞，您可以在應用此代碼之前過濾停用詞。希望這對你有用。

import java.util.List; 

/** 
* Class to calculate TfIdf of term. 
* @author Mubin Shrestha 
*/ 
public class TfIdf { 

    /** 
    * Calculates the tf of term termToCheck 
    * @param totalterms : Array of all the words under processing document 
    * @param termToCheck : term of which tf is to be calculated. 
    * @return tf(term frequency) of term termToCheck 
    */ 
    public double tfCalculator(String[] totalterms, String termToCheck) { 
     double count = 0; //to count the overall occurrence of the term termToCheck 
     for (String s : totalterms) { 
      if (s.equalsIgnoreCase(termToCheck)) { 
       count++; 
      } 
     } 
     return count/totalterms.length; 
    } 

    /** 
    * Calculates idf of term termToCheck 
    * @param allTerms : all the terms of all the documents 
    * @param termToCheck 
    * @return idf(inverse document frequency) score 
    */ 
    public double idfCalculator(List allTerms, String termToCheck) { 
     double count = 0; 
     for (String[] ss : allTerms) { 
      for (String s : ss) { 
       if (s.equalsIgnoreCase(termToCheck)) { 
        count++; 
        break; 
       } 
      } 
     } 
     return 1 + Math.log(allTerms.size()/count); 
    } 
}

來源

2017-08-25 18:00:41 shiv

謝謝@shiv。但是我已經實現了Tf-Idf，並且我使用Lucene來實現（爲了更快的處理）。問題是Tf-Idf爲每個文檔提供「重要條款」，而不是整套文檔。 – Vijender

回答

相關問題