這裏有一個白盒答案:
使用您的原單的代碼,它輸出:
Traceback (most recent call last):
File "test.py", line 17, in <module>
unlabeled_featuresets)
File "/usr/local/lib/python2.7/dist-packages/nltk/classify/positivenaivebayes.py", line 108, in train
for fname, fval in featureset.items():
AttributeError: 'list' object has no attribute 'items'
綜觀17行:
classifier = PositiveNaiveBayesClassifier.train(positive_featuresets,
unlabeled_featuresets)
看來,PositiveNaiveBayesClassifier
需要一個具有'.items()'
屬性的對象,直觀上它應該是如果NLTK代碼是pythonic,則爲dict
。
看着https://github.com/nltk/nltk/blob/develop/nltk/classify/positivenaivebayes.py#L88,沒有的positive_featuresets
參數應包含哪些任何明確的解釋:
:PARAM positive_featuresets:被稱爲 正面例子featuresets的列表(即他們的標籤True
)。
檢查文檔字符串,我們可以看到這個例子:
Example:
>>> from nltk.classify import PositiveNaiveBayesClassifier
Some sentences about sports:
>>> sports_sentences = [ 'The team dominated the game',
... 'They lost the ball',
... 'The game was intense',
... 'The goalkeeper catched the ball',
... 'The other team controlled the ball' ]
Mixed topics, including sports:
>>> various_sentences = [ 'The President did not comment',
... 'I lost the keys',
... 'The team won the game',
... 'Sara has two kids',
... 'The ball went off the court',
... 'They had the ball for the whole game',
... 'The show is over' ]
The features of a sentence are simply the words it contains:
>>> def features(sentence):
... words = sentence.lower().split()
... return dict(('contains(%s)' % w, True) for w in words)
We use the sports sentences as positive examples, the mixed ones ad unlabeled examples:
>>> positive_featuresets = list(map(features, sports_sentences))
>>> unlabeled_featuresets = list(map(features, various_sentences))
>>> classifier = PositiveNaiveBayesClassifier.train(positive_featuresets,
... unlabeled_featuresets)
現在我們找到了feature()
功能的句子轉換爲功能和返回
dict(('contains(%s)' % w, True) for w in words)
基本上,這是有事情能夠撥打.items()
。縱觀字典理解,似乎'contains(%s)' % w
有點多餘,除非它是爲了人類的可讀性。所以你可以使用dict((w, True) for w in words)
。
此外,用下劃線替換空格也可能是多餘的,除非稍後有用。最後,切片和有限迭代本來可以用能夠提取字符節點的ngram函數代替,例如,
>>> word = 'alexgao'
>>> split=3
>>> [word[start:start+split] for start in range(0, len(word)-2)]
['ale', 'lex', 'exg', 'xga', 'gao']
# With ngrams
>>> from nltk.util import ngrams
>>> ["".join(ng) for ng in ngrams(word,3)]
['ale', 'lex', 'exg', 'xga', 'gao']
你的特徵提取功能可能已被簡化爲這樣:
from nltk.util import ngrams
def three_split(word):
return dict(("".join(ng, True) for ng in ngrams(word.lower(),3))
[出]:
{'im ': True, 'm s': True, 'jim': True, 'ilv': True, ' si': True, 'lva': True, 'sil': True}
False
事實上,NLTK分類是如此多才多藝,你可以使用的元組字符作爲特徵,因此在提取特徵時不需要修補ngram,即:
from nltk.classify import PositiveNaiveBayesClassifier
import re
from nltk.util import ngrams
chinese_names = ['gao', 'chan', 'chen', 'Tsai', 'liu', 'Lee']
nonchinese_names = ['silva', 'anderson', 'kidd', 'bryant', 'Jones', 'harris', 'davis']
def three_split(word):
return dict(((ng, True) for ng in ngrams(word.lower(),3))
positive_featuresets = list(map(three_split, chinese_names))
unlabeled_featuresets = list(map(three_split, nonchinese_names))
classifier = PositiveNaiveBayesClassifier.train(positive_featuresets,
unlabeled_featuresets)
print three_split("Jim Silva")
print classifier.classify(three_split("Jim Silva"))
[輸出]:
{('m', ' ', 's'): True, ('j', 'i', 'm'): True, ('s', 'i', 'l'): True, ('i', 'l', 'v'): True, (' ', 's', 'i'): True, ('l', 'v', 'a'): True, ('i', 'm', ' '): True}