從文本內容生成標籤

我很好奇，是否存在通過使用某些權重計算，出現比率或其他工具從給定文本生成關鍵字/標籤的算法/方法。從文本內容生成標籤

此外，如果您指出了任何基於Python的解決方案/庫，我將不勝感激。

感謝

2010-04-18 Hellnar

你有訓練數據？ – bayer 2010-04-18 17:13:17

做到這一點的一種方法是提取出現在文檔中比您期望的更頻繁的單詞。例如，在更大量的文件中，「馬爾可夫」這個詞幾乎從未見過。然而，馬爾科夫在同一個集合中的特定文檔中出現頻率非常高。這表明馬爾可夫可能是一個很好的關鍵字或標籤與文檔相關聯。

要識別這樣的關鍵字，您可以使用關鍵字和文檔的point-wise mutual information。這由PMI(term, doc) = log [ P(term, doc)/(P(term)*P(doc)) ]給出。這將粗略地告訴你，在特定文檔中遇到該術語會讓你驚訝於在更大的集合中碰到這個術語的意圖是多少（或更多）。

要確定與文檔關聯的5個最佳關鍵字，您只需按照他們的PMI分數對文檔進行排序，然後選出5個分數最高的分數。

如果要提取多字標記，請參閱StackOverflow問題How to extract common/significant phrases from a series of text entries。

從我回答這個問題的借款中，NLTK collocations how-to介紹如何在代碼中的約7系使用的n-gram PMI做提取有趣多字的表達，例如：

import nltk 
from nltk.collocations import * 
bigram_measures = nltk.collocations.BigramAssocMeasures() 

# change this to read in your data 
finder = BigramCollocationFinder.from_words(
    nltk.corpus.genesis.words('english-web.txt')) 

# only bigrams that appear 3+ times 
finder.apply_freq_filter(3) 

# return the 5 n-grams with the highest PMI 
finder.nbest(bigram_measures.pmi, 5)

來源

2010-04-18 22:57:28 dmcer

這太棒了，非常感謝您花時間！ – Hellnar 2010-04-19 15:36:26

我在這裏提出了類似的問題：http://stackoverflow.com/questions/2764116/tag-generation-from-a-small-text-content-such-as-tweets 我想知道如果這個算法是在這樣一小段文字上取得成功。 – Hellnar 2010-05-04 09:53:09

+1：哇！究竟是我在找什麼，甚至沒有問它:-) – tmow 2011-01-12 21:22:06

一個非常簡單的解決問題的方法是：

計算每個單詞的出現次數在文本
考慮的最頻繁的條款爲關鍵短語
有黑名單中的「停用詞」，用於刪除常見單詞，例如，和，等等。

雖然我確信有更聰明的，基於統計的解決方案。

如果您需要一個解決方案用於大型項目而不是爲了利益，雅虎BOSS有一個關鍵術語提取方法。

來源

2010-04-18 09:44:57

首先，計算語言學的關鍵python庫是NLTK（「Natural Language Toolkit」）。這是由專業計算語言學家創建和維護的一個穩定，成熟的圖書館。它也有廣泛的教程，常見問題等collection我推薦它高度。

下面是一個簡單的模板，在python代碼中，針對問題中提出的問題;雖然它是一個運行的模板 - 將任何文本作爲字符串提供（如我所做的那樣），它將返回一個詞頻列表以及這些詞的排列列表，按照「重要性」（或適用性作爲關鍵詞）根據一個非常簡單的啓發式。

給定文檔的關鍵詞（顯然）是從文檔中的重要詞語中選出來的 - 也就是那些可能與其他文檔區分開來的詞語。如果你沒有掌握文本主題的知識，一種常用的技術是根據頻率或重要性= 1 /頻率來推斷給定詞/詞的重要性或權重。

text = """ The intensity of the feeling makes up for the disproportion of the objects. Things are equal to the imagination, which have the power of affecting the mind with an equal degree of terror, admiration, delight, or love. When Lear calls upon the heavens to avenge his cause, "for they are old like him," there is nothing extravagant or impious in this sublime identification of his age with theirs; for there is no other image which could do justice to the agonising sense of his wrongs and his despair! """ 

BAD_CHARS = ".!?,\'\"" 

# transform text into a list words--removing punctuation and filtering small words 
words = [ word.strip(BAD_CHARS) for word in text.strip().split() if len(word) > 4 ] 

word_freq = {} 

# generate a 'word histogram' for the text--ie, a list of the frequencies of each word 
for word in words : 
    word_freq[word] = word_freq.get(word, 0) + 1 

# sort the word list by frequency 
# (just a DSU sort, there's a python built-in for this, but i can't remember it) 
tx = [ (v, k) for (k, v) in word_freq.items()] 
tx.sort(reverse=True) 
word_freq_sorted = [ (k, v) for (v, k) in tx ] 

# eg, what are the most common words in that text? 
print(word_freq_sorted) 
# returns: [('which', 4), ('other', 4), ('like', 4), ('what', 3), ('upon', 3)] 
# obviously using a text larger than 50 or so words will give you more meaningful results 

term_importance = lambda word : 1.0/word_freq[word] 

# select document keywords from the words at/near the top of this list: 
map(term_importance, word_freq.keys())

來源

2010-04-18 18:19:09 doug

http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation試圖表示在訓練語料庫爲主題，其又是分佈映射詞概率的混合物的每個文檔。

我曾經使用過它一次，將產品評論的語料分解成所有文檔如「客戶服務」，「產品可用性」等等的潛在觀點。基本模型並不主張一種將主題模型轉換爲描述一個主題的單個單詞的方式......但是人們一旦提出了他們的模型，就想出了各種啓發式方法。

我建議你嘗試用http://mallet.cs.umass.edu/玩，如果這種模式適合您的需要看到..

LDA是完全無監督的算法，這意味着它不要求你交出任何註釋這是偉大的，但在翻蓋可能無法爲您提供您期望它提供的主題。

來源

2010-04-18 22:31:56

從文本內容生成標籤

回答

相關問題