1

我一直試圖解決這個問題,雖然我在這裏發現了類似的問題How can i vectorize list using sklearn DictVectorizer,但解決方案過於簡化。Dictvectorizer列表作爲Python中的一個功能Pandas and Scikit-learn

我想將一些特徵擬合到邏輯迴歸模型中來預測'中文'或'非中文'。我有一個raw_name,我將提取以獲取兩個特徵1)僅僅是姓氏,2)是姓氏的子字符串列表,例如,'Chan'會給['ch','ha', '一個']。但似乎Dictvectorizer不會將列表類型作爲字典的一部分。從上面的鏈接,我嘗試創建一個功能list_to_dict,併成功,返回一些字典元素,

{'substring=co': True, 'substring=or': True, 'substring=rn': True, 'substring=ns': True} 

,但我不知道如何將是,在my_dict = ...應用dictvectorizer之前。

# coding=utf-8 
import pandas as pd 
from pandas import DataFrame, Series 
import numpy as np 
import nltk 
import re 
import random 
from random import randint 
import sys 
reload(sys) 
sys.setdefaultencoding('utf-8') 

from sklearn.linear_model import LogisticRegression 
from sklearn.feature_extraction import DictVectorizer 

lr = LogisticRegression() 
dv = DictVectorizer() 

# Get csv file into data frame 
data = pd.read_csv("V2-1_2000Records_Processed_SEP2015.csv", header=0, encoding="utf-8") 
df = DataFrame(data) 

# Pandas data frame shuffling 
df_shuffled = df.iloc[np.random.permutation(len(df))] 
df_shuffled.reset_index(drop=True) 

# Assign X and y variables 
X = df.raw_name.values 
y = df.chineseScan.values 

# Feature extraction functions 
def feature_full_last_name(nameString): 
    try: 
     last_name = nameString.rsplit(None, 1)[-1] 
     if len(last_name) > 1: # not accept name with only 1 character 
      return last_name 
     else: return None 
    except: return None 

def feature_twoLetters(nameString): 
    placeHolder = [] 
    try: 
     for i in range(0, len(nameString)): 
      x = nameString[i:i+2] 
      if len(x) == 2: 
       placeHolder.append(x) 
     return placeHolder 
    except: return [] 

def list_to_dict(substring_list): 
    try: 
     substring_dict = {} 
     for i in substring_list: 
      substring_dict['substring='+str(i)] = True 
     return substring_dict 
    except: return None 

list_example = ['co', 'or', 'rn', 'ns'] 
print list_to_dict(list_example) 

# Transform format of X variables, and spit out a numpy array for all features 
my_dict = [{'two-letter-substrings': feature_twoLetters(feature_full_last_name(i)), 
    'last-name': feature_full_last_name(i), 'dummy': 1} for i in X] 

print my_dict[3] 

輸出:

{'substring=co': True, 'substring=or': True, 'substring=rn': True, 'substring=ns': True} 
{'dummy': 1, 'two-letter-substrings': [u'co', u'or', u'rn', u'ns'], 'last-name': u'corns'} 

的樣本數據:

Raw_name chineseScan 
Jack Anderson non-chinese 
Po Lee chinese 

回答

1

如果我理解正確的,你想辦法,以便有DictVectorizer可以使用的特徵字典編碼值列表。

my_dict_list = [] 

for i in X: 
    # create a new feature dictionary 
    feat_dict = {} 
    # add the features that are straight forward 
    feat_dict['last-name'] = feature_full_last_name(i) 
    feat_dict['dummy'] = 1 

    # for the features that have a list of values iterate over the values and 
    # create a custom feature for each value 
    for two_letters in feature_twoLetters(feature_full_last_name(i)): 
     # make sure the naming is unique enough so that no other feature 
     # unrelated to this will have the same name/ key 
     feat_dict['two-letter-substrings-' + two_letters] = True 

    # save it to the feature dictionary list that will be used in Dict vectorizer 
    my_dict_list.append(feat_dict) 

print my_dict_list 

from sklearn.feature_extraction import DictVectorizer 
dict_vect = DictVectorizer(sparse=False) 
transformed_x = dict_vect.fit_transform(my_dict_list) 
print transformed_x 

輸出:如果您

[{'dummy': 1, u'two-letter-substrings-er': True, 'last-name': u'Anderson', u'two-letter-substrings-on': True, u'two-letter-substrings-de': True, u'two-letter-substrings-An': True, u'two-letter-substrings-rs': True, u'two-letter-substrings-nd': True, u'two-letter-substrings-so': True}, {'dummy': 1, u'two-letter-substrings-ee': True, u'two-letter-substrings-Le': True, 'last-name': u'Lee'}] 
[[ 1. 1. 0. 1. 0. 1. 0. 1. 1. 1. 1. 1.] 
[ 1. 0. 1. 0. 1. 0. 1. 0. 0. 0. 0. 0.]] 

你可以做(​​但我不推薦)另一件事唐(一年爲時已晚,但)這樣的事情可以根據情況使用「不想在你的列表創建的許多功能的值是這樣的:

# sorting the values would be a good idea 
feat_dict[frozenset(feature_twoLetters(feature_full_last_name(i)))] = True 
# or 
feat_dict[" ".join(feature_twoLetters(feature_full_last_name(i)))] = True 

但第一個意思是,你不能有任何重複值,可能都不會做出好等特點, especia如果你需要精細和細節的話。另外,它們減少了兩行具有兩個字母組合的相同組合的可能性,因此分類可能不會很好。

輸出:

[{'dummy': 1, 'last-name': u'Anderson', frozenset([u'on', u'rs', u'de', u'nd', u'An', u'so', u'er']): True}, {'dummy': 1, 'last-name': u'Lee', frozenset([u'ee', u'Le']): True}] 
[{'dummy': 1, 'last-name': u'Anderson', u'An nd de er rs so on': True}, {'dummy': 1, u'Le ee': True, 'last-name': u'Lee'}] 
[[ 1. 0. 1. 1. 0.] 
[ 0. 1. 1. 0. 1.]]