如何使用gensim從語料庫中提取短語

對於語料庫的預處理，我打算從語料庫中提取常用短語，爲此我嘗試使用短語 gensim中的模型，我嘗試了下面的代碼，但它沒有給出我想要的輸出。如何使用gensim從語料庫中提取短語

我的代碼

from gensim.models import Phrases 
documents = ["the mayor of new york was there", "machine learning can be useful sometimes"] 

sentence_stream = [doc.split(" ") for doc in documents] 
bigram = Phrases(sentence_stream) 
sent = [u'the', u'mayor', u'of', u'new', u'york', u'was', u'there'] 
print(bigram[sent])

輸出

[u'the', u'mayor', u'of', u'new', u'york', u'was', u'there']

但它應是

[u'the', u'mayor', u'of', u'new_york', u'was', u'there']

但是，當我竭力試圖o打印列車數據的詞彙表，我可以看到bigram，但是它沒有與測試數據一起工作，哪裏出錯了？

print bigram.vocab 

defaultdict(<type 'int'>, {'useful': 1, 'was_there': 1, 'learning_can': 1, 'learning': 1, 'of_new': 1, 'can_be': 1, 'mayor': 1, 'there': 1, 'machine': 1, 'new': 1, 'was': 1, 'useful_sometimes': 1, 'be': 1, 'mayor_of': 1, 'york_was': 1, 'york': 1, 'machine_learning': 1, 'the_mayor': 1, 'new_york': 1, 'of': 1, 'sometimes': 1, 'can': 1, 'be_useful': 1, 'the': 1})

來源

2016-03-01 Prashant Puri

我得到了問題的解決方案，有兩個參數我沒有照顧它應傳遞給詞（）模型，這些都是

min_count忽略總收集數低於此的所有單詞和雙字母。 Bydefault它值是5
閾表示用於形成短語（較高意味着更少的詞組）的閾值。如果（cnt（a，b）-min_count）* N /（cnt（a）* cnt（b））>閾值，則接受單詞a和b的短語，其中N是總詞彙量大小。 Bydefault它值是10.0

與兩個語句我的上述列車數據，閾值是0 ，所以改變列車的數據集，並添加這兩個參數。

我的新代碼

from gensim.models import Phrases 
documents = ["the mayor of new york was there", "machine learning can be useful sometimes","new york mayor was present"] 

sentence_stream = [doc.split(" ") for doc in documents] 
bigram = Phrases(sentence_stream, min_count=1, threshold=2) 
sent = [u'the', u'mayor', u'of', u'new', u'york', u'was', u'there'] 
print(bigram[sent])

輸出

[u'the', u'mayor', u'of', u'new_york', u'was', u'there']

Gensim是真正真棒:)

來源

2016-03-02 13:39:57

比你的寶貴答案。但在這個例子中，bigram並沒有把「machine」，「learning」作爲「machine_learning」。你知道爲什麼會發生嗎？ – 2017-09-10 05:16:56

如果在訓練兩次之前在句子中添加「機器學習」，然後將其添加到發送的變量中，您將獲得「machine_learning」。如果它看不到這一對的頻率，那麼它不會直觀地知道。 – ethanenglish

如何使用gensim從語料庫中提取短語

回答

相關問題