我是NLTK Python的新手,我正在尋找一些可以對詞義進行消歧的示例應用程序。我在搜索結果中有很多算法,但沒有示例應用程序。我只想通過一個句子,並希望通過參考wordnet庫來了解每個單詞的含義。 謝謝NLTK中的詞義消歧蟒Python
我在PERL中找到了一個類似的模塊。 http://marimba.d.umn.edu/allwords/allwords.html NLTK Python中是否有這樣的模塊?
我是NLTK Python的新手,我正在尋找一些可以對詞義進行消歧的示例應用程序。我在搜索結果中有很多算法,但沒有示例應用程序。我只想通過一個句子,並希望通過參考wordnet庫來了解每個單詞的含義。 謝謝NLTK中的詞義消歧蟒Python
我在PERL中找到了一個類似的模塊。 http://marimba.d.umn.edu/allwords/allwords.html NLTK Python中是否有這樣的模塊?
是的,它可以在NLTK中使用wordnet模塊。 在您的文章中提到的工具中使用的相似性度量也存在於NLTK wordnet模塊中。
此鏈接已死亡。你能提供一個工作嗎? – Hooked 2015-01-09 04:46:55
NLTK有API來訪問WORDNET。 Wordnet將單詞作爲同義詞。這會給你一些關於這個詞,它的上位詞,下位詞,根詞等的信息。
「Python文本處理與NLTK 2.0食譜」是一本好書,讓你開始瞭解NLTK的各種功能。閱讀,理解和實施很容易。另外,你可以看看其他論文(在NLTK領域之外),其中討論了使用維基百科進行詞義消歧。
是的,實際上,有NLTK團隊編寫的a book,其中有多個章節的分類,他們明確涵蓋how to use WordNet。您也可以從Safari購買本書的物理版本。
僅供參考:NLTK由自然語言編程學者編寫,用於他們的入門編程課程。
據我瞭解,該章致力於分類,但它不是很符合詞義消歧。 – geekazoid 2014-01-03 03:40:57
作爲一個實際的答案OP的要求,這裏有幾個WSD方法Python實現,在NLTK的同義詞集合(S)的形式返回感官,https://github.com/alvations/pywsd
它包括
它可以用作例如:
#!/usr/bin/env python -*- coding: utf-8 -*-
bank_sents = ['I went to the bank to deposit my money',
'The river bank was full of dead fishes']
plant_sents = ['The workers at the industrial plant were overworked',
'The plant was no longer bearing flowers']
print "======== TESTING simple_lesk ===========\n"
from lesk import simple_lesk
print "#TESTING simple_lesk() ..."
print "Context:", bank_sents[0]
answer = simple_lesk(bank_sents[0],'bank')
print "Sense:", answer
print "Definition:",answer.definition
print
print "#TESTING simple_lesk() with POS ..."
print "Context:", bank_sents[1]
answer = simple_lesk(bank_sents[1],'bank','n')
print "Sense:", answer
print "Definition:",answer.definition
print
print "#TESTING simple_lesk() with POS and stems ..."
print "Context:", plant_sents[0]
answer = simple_lesk(plant_sents[0],'plant','n', True)
print "Sense:", answer
print "Definition:",answer.definition
print
print "======== TESTING baseline ===========\n"
from baseline import random_sense, first_sense
from baseline import max_lemma_count as most_frequent_sense
print "#TESTING random_sense() ..."
print "Context:", bank_sents[0]
answer = random_sense('bank')
print "Sense:", answer
print "Definition:",answer.definition
print
print "#TESTING first_sense() ..."
print "Context:", bank_sents[0]
answer = first_sense('bank')
print "Sense:", answer
print "Definition:",answer.definition
print
print "#TESTING most_frequent_sense() ..."
print "Context:", bank_sents[0]
answer = most_frequent_sense('bank')
print "Sense:", answer
print "Definition:",answer.definition
print
[OUT]:
======== TESTING simple_lesk ===========
#TESTING simple_lesk() ...
Context: I went to the bank to deposit my money
Sense: Synset('depository_financial_institution.n.01')
Definition: a financial institution that accepts deposits and channels the money into lending activities
#TESTING simple_lesk() with POS ...
Context: The river bank was full of dead fishes
Sense: Synset('bank.n.01')
Definition: sloping land (especially the slope beside a body of water)
#TESTING simple_lesk() with POS and stems ...
Context: The workers at the industrial plant were overworked
Sense: Synset('plant.n.01')
Definition: buildings for carrying on industrial labor
======== TESTING baseline ===========
#TESTING random_sense() ...
Context: I went to the bank to deposit my money
Sense: Synset('deposit.v.02')
Definition: put into a bank account
#TESTING first_sense() ...
Context: I went to the bank to deposit my money
Sense: Synset('bank.n.01')
Definition: sloping land (especially the slope beside a body of water)
#TESTING most_frequent_sense() ...
Context: I went to the bank to deposit my money
Sense: Synset('bank.n.01')
Definition: sloping land (especially the slope beside a body of water)
最近,pywsd
代碼的一部分已被移植到的NLTK
的最新版本中模塊,嘗試:
>>> from nltk.wsd import lesk
>>> sent = 'I went to the bank to deposit my money'
>>> ambiguous = 'bank'
>>> lesk(sent, ambiguous)
Synset('bank.v.04')
>>> lesk(sent, ambiguous).definition()
u'act as the banker in a game or in gambling'
爲了獲得更好的性能WSD,而不是使用的NLTK
模塊pywsd
庫。一般來說,從pywsd
的simple_lesk()
比NLTK
的lesk
好。當我有空時,我會盡量更新NLTK
模塊。
在迴應克里斯斯賓塞的評論,請注意的Lesk算法的限制。我只是簡單地給出一個算法的準確實現。這不是一個銀彈,http://en.wikipedia.org/wiki/Lesk_algorithm
還要注意的是,雖然:
lesk("My cat likes to eat mice.", "cat", "n")
不給你正確的答案,你可以使用pywsd
實施max_similarity()
:
>>> from pywsd.similarity import max_similiarity
>>> max_similarity('my cat likes to eat mice', 'cat', 'wup', pos='n').definition
'feline mammal usually having thick soft fur and no ability to roar: domestic cats; wildcats'
>>> max_similarity('my cat likes to eat mice', 'cat', 'lin', pos='n').definition
'feline mammal usually having thick soft fur and no ability to roar: domestic cats; wildcats'
@ Chris,如果你想要一個python setup.py,只是做一個禮貌的請求,我會寫它...
不幸的是,準確性非常糟糕。 'lesk(「我的貓喜歡吃老鼠。」,「貓」,「n」)'=>'Synset('computerized_tomography.n.01')'。而且pywsd甚至沒有安裝腳本... – Cerin 2014-08-23 02:47:18
親愛的克里斯,你有沒有試過lesk的其他變種? ESP。 'simple_lesk()'或'adapted_lesk'?已知原始版本有問題,因此可以在軟件包中找到其他解決方案。 http://en.wikipedia.org/wiki/Lesk_algorithm。另外,我在我的空閒時間裏維護着,這不是我以生活爲目的... – alvas 2014-08-23 16:52:52
是的,我在包裝中嘗試了Lesk的每個變體,而且沒有任何工作在我的樣本語料庫上。我不得不創建一個變體,該變體還使用與該詞相關的所有下標和單數形式的光暈,以獲得少數積極結果,但即便如此,它的準確率也只有15%。這不是你的代碼,這是Lesk的問題。這根本不是可靠的啓發式。 – Cerin 2014-08-24 23:32:58
這裏是一個python實現:https:// github .com/alvations/pywsd – alvas 2014-02-28 09:56:34