我一直試圖解決這個問題,雖然我在這裏發現了類似的問題How can i vectorize list using sklearn DictVectorizer,但解決方案過於簡化。Dictvectorizer列表作爲Python中的一個功能Pandas and Scikit-learn
我想將一些特徵擬合到邏輯迴歸模型中來預測'中文'或'非中文'。我有一個raw_name,我將提取以獲取兩個特徵1)僅僅是姓氏,2)是姓氏的子字符串列表,例如,'Chan'會給['ch','ha', '一個']。但似乎Dictvectorizer不會將列表類型作爲字典的一部分。從上面的鏈接,我嘗試創建一個功能list_to_dict,併成功,返回一些字典元素,
{'substring=co': True, 'substring=or': True, 'substring=rn': True, 'substring=ns': True}
,但我不知道如何將是,在my_dict = ...應用dictvectorizer之前。
# coding=utf-8
import pandas as pd
from pandas import DataFrame, Series
import numpy as np
import nltk
import re
import random
from random import randint
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction import DictVectorizer
lr = LogisticRegression()
dv = DictVectorizer()
# Get csv file into data frame
data = pd.read_csv("V2-1_2000Records_Processed_SEP2015.csv", header=0, encoding="utf-8")
df = DataFrame(data)
# Pandas data frame shuffling
df_shuffled = df.iloc[np.random.permutation(len(df))]
df_shuffled.reset_index(drop=True)
# Assign X and y variables
X = df.raw_name.values
y = df.chineseScan.values
# Feature extraction functions
def feature_full_last_name(nameString):
try:
last_name = nameString.rsplit(None, 1)[-1]
if len(last_name) > 1: # not accept name with only 1 character
return last_name
else: return None
except: return None
def feature_twoLetters(nameString):
placeHolder = []
try:
for i in range(0, len(nameString)):
x = nameString[i:i+2]
if len(x) == 2:
placeHolder.append(x)
return placeHolder
except: return []
def list_to_dict(substring_list):
try:
substring_dict = {}
for i in substring_list:
substring_dict['substring='+str(i)] = True
return substring_dict
except: return None
list_example = ['co', 'or', 'rn', 'ns']
print list_to_dict(list_example)
# Transform format of X variables, and spit out a numpy array for all features
my_dict = [{'two-letter-substrings': feature_twoLetters(feature_full_last_name(i)),
'last-name': feature_full_last_name(i), 'dummy': 1} for i in X]
print my_dict[3]
輸出:
{'substring=co': True, 'substring=or': True, 'substring=rn': True, 'substring=ns': True}
{'dummy': 1, 'two-letter-substrings': [u'co', u'or', u'rn', u'ns'], 'last-name': u'corns'}
的樣本數據:
Raw_name chineseScan
Jack Anderson non-chinese
Po Lee chinese