1

例如,我想使用姓氏預測中國人與非中國人的種族。特別是我想從姓氏中提取三個字母的子字符串。例如,姓「gao」將會給出一個特徵爲「gao」,而「chan」則會給出兩個特徵爲「cha」和「han」。爲Python機器學習(樸素貝葉斯)算法創建特徵字典

分割在下面的three_split函數中成功完成。但據我所知,要將其作爲一個功能集合,我需要將輸出作爲字典返回。如何做到這一點的任何提示?對於「Chan」字典,字典應該將「cha」和「han」返回爲TRUE。

from nltk.classify import PositiveNaiveBayesClassifier 
import re 

chinese_names = ['gao', 'chan', 'chen', 'Tsai', 'liu', 'Lee'] 

nonchinese_names = ['silva', 'anderson', 'kidd', 'bryant', 'Jones', 'harris', 'davis'] 

def three_split(word): 
    word = word.lower() 
    word = word.replace(" ", "_") 
    split = 3 
    return [word[start:start+split] for start in range(0, len(word)-2)] 

positive_featuresets = list(map(three_split, chinese_names)) 
unlabeled_featuresets = list(map(three_split, nonchinese_names)) 
classifier = PositiveNaiveBayesClassifier.train(positive_featuresets, 
    unlabeled_featuresets) 

print three_split("Jim Silva") 
print classifier.classify(three_split("Jim Silva")) 

回答

2

這裏有一個白盒答案:

使用您的原單的代碼,它輸出:

Traceback (most recent call last): 
    File "test.py", line 17, in <module> 
    unlabeled_featuresets) 
    File "/usr/local/lib/python2.7/dist-packages/nltk/classify/positivenaivebayes.py", line 108, in train 
    for fname, fval in featureset.items(): 
AttributeError: 'list' object has no attribute 'items' 

綜觀17行:

classifier = PositiveNaiveBayesClassifier.train(positive_featuresets, 
    unlabeled_featuresets) 

看來,PositiveNaiveBayesClassifier需要一個具有'.items()'屬性的對象,直觀上它應該是如果NLTK代碼是pythonic,則爲dict

看着https://github.com/nltk/nltk/blob/develop/nltk/classify/positivenaivebayes.py#L88,沒有的positive_featuresets參數應包含哪些任何明確的解釋:

:PARAM positive_featuresets:被稱爲 正面例子featuresets的列表(即他們的標籤True)。

檢查文檔字符串,我們可以看到這個例子:

Example: 
    >>> from nltk.classify import PositiveNaiveBayesClassifier 
Some sentences about sports: 
    >>> sports_sentences = [ 'The team dominated the game', 
    ...      'They lost the ball', 
    ...      'The game was intense', 
    ...      'The goalkeeper catched the ball', 
    ...      'The other team controlled the ball' ] 
Mixed topics, including sports: 
    >>> various_sentences = [ 'The President did not comment', 
    ...      'I lost the keys', 
    ...      'The team won the game', 
    ...      'Sara has two kids', 
    ...      'The ball went off the court', 
    ...      'They had the ball for the whole game', 
    ...      'The show is over' ] 
The features of a sentence are simply the words it contains: 
    >>> def features(sentence): 
    ...  words = sentence.lower().split() 
    ...  return dict(('contains(%s)' % w, True) for w in words) 
We use the sports sentences as positive examples, the mixed ones ad unlabeled examples: 
    >>> positive_featuresets = list(map(features, sports_sentences)) 
    >>> unlabeled_featuresets = list(map(features, various_sentences)) 
    >>> classifier = PositiveNaiveBayesClassifier.train(positive_featuresets, 
    ...             unlabeled_featuresets) 

現在我們找到了feature()功能的句子轉換爲功能和返回

dict(('contains(%s)' % w, True) for w in words) 

基本上,這是有事情能夠撥打.items()。縱觀字典理解,似乎'contains(%s)' % w有點多餘,除非它是爲了人類的可讀性。所以你可以使用dict((w, True) for w in words)

此外,用下劃線替換空格也可能是多餘的,除非稍後有用。最後,切片和有限迭代本來可以用能夠提取字符節點的ngram函數代替,例如,

>>> word = 'alexgao' 
>>> split=3 
>>> [word[start:start+split] for start in range(0, len(word)-2)] 
['ale', 'lex', 'exg', 'xga', 'gao'] 
# With ngrams 
>>> from nltk.util import ngrams 
>>> ["".join(ng) for ng in ngrams(word,3)] 
['ale', 'lex', 'exg', 'xga', 'gao'] 

你的特徵提取功能可能已被簡化爲這樣:

from nltk.util import ngrams 
def three_split(word): 
    return dict(("".join(ng, True) for ng in ngrams(word.lower(),3)) 

[出]:

{'im ': True, 'm s': True, 'jim': True, 'ilv': True, ' si': True, 'lva': True, 'sil': True} 
False 

事實上,NLTK分類是如此多才多藝,你可以使用的元組字符作爲特徵,因此在提取特徵時不需要修補ngram,即:

from nltk.classify import PositiveNaiveBayesClassifier 
import re 
from nltk.util import ngrams 

chinese_names = ['gao', 'chan', 'chen', 'Tsai', 'liu', 'Lee'] 

nonchinese_names = ['silva', 'anderson', 'kidd', 'bryant', 'Jones', 'harris', 'davis'] 


def three_split(word): 
    return dict(((ng, True) for ng in ngrams(word.lower(),3)) 

positive_featuresets = list(map(three_split, chinese_names)) 
unlabeled_featuresets = list(map(three_split, nonchinese_names)) 

classifier = PositiveNaiveBayesClassifier.train(positive_featuresets, 
    unlabeled_featuresets) 

print three_split("Jim Silva") 
print classifier.classify(three_split("Jim Silva")) 

[輸出]:

{('m', ' ', 's'): True, ('j', 'i', 'm'): True, ('s', 'i', 'l'): True, ('i', 'l', 'v'): True, (' ', 's', 'i'): True, ('l', 'v', 'a'): True, ('i', 'm', ' '): True} 
0

隨着一些試驗和錯誤,我想我已經得到了它。謝謝。

from nltk.classify import PositiveNaiveBayesClassifier 
import re 

chinese_names = ['gao', 'chan', 'chen', 'Tsai', 'liu', 'Lee'] 

nonchinese_names = ['silva', 'anderson', 'kidd', 'bryant', 'Jones', 'harris', 'davis'] 

def three_split(word): 
    word = word.lower() 
    word = word.replace(" ", "_") 
    split = 3 
    return dict(("contains(%s)" % word[start:start+split], True) 
     for start in range(0, len(word)-2)) 

positive_featuresets = list(map(three_split, chinese_names)) 
unlabeled_featuresets = list(map(three_split, nonchinese_names)) 
classifier = PositiveNaiveBayesClassifier.train(positive_featuresets, 
    unlabeled_featuresets) 

name = "dennis kidd" 
print three_split(name) 
print classifier.classify(three_split(name))