2010-09-13 140 views
21

我是NLTK Python的新手,我正在尋找一些可以對詞義進行消歧的示例應用程序。我在搜索結果中有很多算法,但沒有示例應用程序。我只想通過一個句子,並希望通過參考wordnet庫來了解每個單詞的含義。 謝謝NLTK中的詞義消歧蟒Python

我在PERL中找到了一個類似的模塊。 http://marimba.d.umn.edu/allwords/allwords.html NLTK Python中是否有這樣的模塊?

+1

這裏是一個python實現:https:// github .com/alvations/pywsd – alvas 2014-02-28 09:56:34

回答

-1

是的,它可以在NLTK中使用wordnet模塊。 在您的文章中提到的工具中使用的相似性度量也存在於NLTK wordnet模塊中。

0

NLTK有API來訪問WORDNET。 Wordnet將單詞作爲同義詞。這會給你一些關於這個詞,它的上位詞,下位詞,根詞等的信息。

「Python文本處理與NLTK 2.0食譜」是一本好書,讓你開始瞭解NLTK的各種功能。閱讀,理解和實施很容易。另外,你可以看看其他論文(在NLTK領域之外),其中討論了使用維基百科進行詞義消歧。

7

是的,實際上,有NLTK團隊編寫的a book,其中有多個章節的分類,他們明確涵蓋how to use WordNet。您也可以從Safari購買本書的物理版本。

僅供參考:NLTK由自然語言編程學者編寫,用於他們的入門編程課程。

+4

據我瞭解,該章致力於分類,但它不是很符合詞義消歧。 – geekazoid 2014-01-03 03:40:57

3

作爲一個實際的答案OP的要求,這裏有幾個WSD方法Python實現,在NLTK的同義詞集合(S)的形式返回感官,https://github.com/alvations/pywsd

它包括

  • Lesk算法(包括原始萊斯克,改編Lesk簡單Lesk
  • 基線算法(隨機感,第一感測,最頻繁的感應)

它可以用作例如:

#!/usr/bin/env python -*- coding: utf-8 -*- 

bank_sents = ['I went to the bank to deposit my money', 
'The river bank was full of dead fishes'] 

plant_sents = ['The workers at the industrial plant were overworked', 
'The plant was no longer bearing flowers'] 

print "======== TESTING simple_lesk ===========\n" 
from lesk import simple_lesk 
print "#TESTING simple_lesk() ..." 
print "Context:", bank_sents[0] 
answer = simple_lesk(bank_sents[0],'bank') 
print "Sense:", answer 
print "Definition:",answer.definition 
print 

print "#TESTING simple_lesk() with POS ..." 
print "Context:", bank_sents[1] 
answer = simple_lesk(bank_sents[1],'bank','n') 
print "Sense:", answer 
print "Definition:",answer.definition 
print 

print "#TESTING simple_lesk() with POS and stems ..." 
print "Context:", plant_sents[0] 
answer = simple_lesk(plant_sents[0],'plant','n', True) 
print "Sense:", answer 
print "Definition:",answer.definition 
print 

print "======== TESTING baseline ===========\n" 
from baseline import random_sense, first_sense 
from baseline import max_lemma_count as most_frequent_sense 

print "#TESTING random_sense() ..." 
print "Context:", bank_sents[0] 
answer = random_sense('bank') 
print "Sense:", answer 
print "Definition:",answer.definition 
print 

print "#TESTING first_sense() ..." 
print "Context:", bank_sents[0] 
answer = first_sense('bank') 
print "Sense:", answer 
print "Definition:",answer.definition 
print 

print "#TESTING most_frequent_sense() ..." 
print "Context:", bank_sents[0] 
answer = most_frequent_sense('bank') 
print "Sense:", answer 
print "Definition:",answer.definition 
print 

[OUT]:

======== TESTING simple_lesk =========== 

#TESTING simple_lesk() ... 
Context: I went to the bank to deposit my money 
Sense: Synset('depository_financial_institution.n.01') 
Definition: a financial institution that accepts deposits and channels the money into lending activities 

#TESTING simple_lesk() with POS ... 
Context: The river bank was full of dead fishes 
Sense: Synset('bank.n.01') 
Definition: sloping land (especially the slope beside a body of water) 

#TESTING simple_lesk() with POS and stems ... 
Context: The workers at the industrial plant were overworked 
Sense: Synset('plant.n.01') 
Definition: buildings for carrying on industrial labor 

======== TESTING baseline =========== 
#TESTING random_sense() ... 
Context: I went to the bank to deposit my money 
Sense: Synset('deposit.v.02') 
Definition: put into a bank account 

#TESTING first_sense() ... 
Context: I went to the bank to deposit my money 
Sense: Synset('bank.n.01') 
Definition: sloping land (especially the slope beside a body of water) 

#TESTING most_frequent_sense() ... 
Context: I went to the bank to deposit my money 
Sense: Synset('bank.n.01') 
Definition: sloping land (especially the slope beside a body of water) 
11

最近,pywsd代碼的一部分已被移植到的NLTK的最新版本中模塊,嘗試:

>>> from nltk.wsd import lesk 
>>> sent = 'I went to the bank to deposit my money' 
>>> ambiguous = 'bank' 
>>> lesk(sent, ambiguous) 
Synset('bank.v.04') 
>>> lesk(sent, ambiguous).definition() 
u'act as the banker in a game or in gambling' 

爲了獲得更好的性能WSD,而不是使用的NLTK模塊pywsd庫。一般來說,從pywsdsimple_lesk()NLTKlesk好。當我有空時,我會盡量更新NLTK模塊。


在迴應克里斯斯賓塞的評論,請注意的Lesk算法的限制。我只是簡單地給出一個算法的準確實現。這不是一個銀彈,http://en.wikipedia.org/wiki/Lesk_algorithm

還要注意的是,雖然:

lesk("My cat likes to eat mice.", "cat", "n") 

不給你正確的答案,你可以使用pywsd實施max_similarity()

>>> from pywsd.similarity import max_similiarity 
>>> max_similarity('my cat likes to eat mice', 'cat', 'wup', pos='n').definition 
'feline mammal usually having thick soft fur and no ability to roar: domestic cats; wildcats' 
>>> max_similarity('my cat likes to eat mice', 'cat', 'lin', pos='n').definition 
'feline mammal usually having thick soft fur and no ability to roar: domestic cats; wildcats' 

@ Chris,如果你想要一個python setup.py,只是做一個禮貌的請求,我會寫它...

+1

不幸的是,準確性非常糟糕。 'lesk(「我的貓喜歡吃老鼠。」,「貓」,「n」)'=>'Synset('computerized_tomography.n.01')'。而且pywsd甚至沒有安裝腳本... – Cerin 2014-08-23 02:47:18

+1

親愛的克里斯,你有沒有試過lesk的其他變種? ESP。 'simple_lesk()'或'adapted_lesk'?已知原始版本有問題,因此可以在軟件包中找到其他解決方案。 http://en.wikipedia.org/wiki/Lesk_algorithm。另外,我在我的空閒時間裏維護着,這不是我以生活爲目的... – alvas 2014-08-23 16:52:52

+1

是的,我在包裝中嘗試了Lesk的每個變體,而且沒有任何工作在我的樣本語料庫上。我不得不創建一個變體,該變體還使用與該詞相關的所有下標和單數形式的光暈,以獲得少數積極結果,但即便如此,它的準確率也只有15%。這不是你的代碼,這是Lesk的問題。這根本不是可靠的啓發式。 – Cerin 2014-08-24 23:32:58