2016-04-27 74 views
0

我目前的數據集包含大約28,000個觀察值和35個特徵。我的X矩陣包含了前34個特徵,我的矩陣包含了最後一個特徵或35個特徵(我已經在下面的代碼中將它標記爲HighLowMobility)我已經構建了一個神經網絡來分類高與低,然而我的算法的準確性由於缺少數據點,爲12%。我遇到了一些我的功能缺少大量數據點的問題。我繞過它的一種方式是填補缺失值的含義。這將算法的準確性提高到56%,但我不喜歡使用均值作爲缺失值的想法。我想尋求另一種方法尋求數據集中缺失值的解決方案

#loading the data into data frame 
X = pd.read_csv('raw_data_for_edits.csv') 
#Impute the missing values with mean values,. 
X = X.fillna(X.mean()) 
#Dropping the categorical values 
X = X.drop(['county_name','statename','stateabbrv'],axis=1) 
#Collect the output in y variable 
y = X['HighLowMobility'] 

我不能複製和粘貼我的整個數據集,因爲它太大,不過我貼在第一排12和15周的特點:

birthcohort countyfipscode county_name cty_pop2000 statename state_id stateabbrv perm_res_p25_kr24 perm_res_p75_kr24 perm_res_p25_c1823 perm_res_p75_c1823 perm_res_p25_c19 perm_res_p75_c19 perm_res_p25_kr26 perm_res_p75_kr26 
1980 1001 Autauga 43671 Alabama 1 AL 45.29939 60.7061    20.79255 66.0626 40.33072 61.38815 
1981 1001 Autauga 43671 Alabama 1 AL 42.61835 63.21074 29.72325 75.26598 18.54342 54.94438 39.72811 65.40214 
1982 1001 Autauga 43671 Alabama 1 AL 48.26985 62.34378 38.06422 72.25443 21.53552 59.08011 44.65976 63.69386 
1983 1001 Autauga 43671 Alabama 1 AL 42.63371 56.42043 38.25876 80.4664 15.57722 57.13945 40.6005 61.02879 
1984 1001 Autauga 43671 Alabama 1 AL 44.01634 62.27992 38.12383 73.74701 23.0881 55.17943 43.34503 62.40761 
1985 1001 Autauga 43671 Alabama 1 AL 45.71784 61.31874 40.93386 83.06611 25.66557 72.2912 42.42057 62.00612 
1986 1001 Autauga 43671 Alabama 1 AL 47.92037 59.65535 47.48409 72.49103 28.89066 63.85233 42.06915 59.60703 
1987 1001 Autauga 43671 Alabama 1 AL 48.31079 54.04203 53.19901 84.53795 35.28359 71.83407   
1988 1001 Autauga 43671 Alabama 1 AL 47.98552 59.42001 52.89273 85.28442 30.55523 67.43595   
1980 1003 Baldwin 140415 Alabama 1 AL 42.46106 51.41415   19.86316 58.6601 41.89684 55.88935 
1981 1003 Baldwin 140415 Alabama 1 AL 43.00288 55.10138 35.59233 76.98567 11.48056 40.79744 42.46521 57.31494 

注意如何功能「perm_res_p25_c1823」缺少值。就我的算法的準確性而言,這成爲問題。 因此,我應該怎麼做,因爲缺少值?我讀了一些關於插值的內容,我會這樣做嗎?如果是這樣,我會如何編碼?

回答

0

一種方式做到這一點是使用預處理程序,我建議scikit-learn,根據您的情況,我的例子將使用一個簡單的「的意思是」戰略轉型丟失的數據「南」,像這樣:

In [1]: import pandas as pd 

In [2]: from sklearn.preprocessing import Imputer 

# df is a copy from your sample data 

In [6]: values = df[['perm_res_p25_kr26', 'perm_res_p75_kr26']].values 

In [7]: values 
Out[7]: 
array([[  nan,  nan], 
     [ 39.72811, 65.40214], 
     [ 44.65976, 63.69386], 
     [ 40.6005 , 61.02879], 
     [ 43.34503, 62.40761], 
     [ 42.42057, 62.00612], 
     [ 42.06915, 59.60703], 
     [  nan,  nan], 
     [  nan,  nan], 
     [  nan,  nan], 
     [ 42.46521, 57.31494]]) 

# use a Imputer simple "mean" strategy to preprocess your missing data 
In [8]: imp = Imputer(missing_values="NaN", strategy="mean", axis=0) 
# simple fit & transform operations 
In [9]: imp.fit(values) 
Out[9]: Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0) 
# assign the missing values back to the dataframe 
In [10]: df.ix[:, 'perm_res_p25_kr26':'perm_res_p75_kr26'] = imp.transform(values) 
# and your missing data is taken care of 
In [12]: df[['perm_res_p25_kr26', 'perm_res_p75_kr26']] 
Out[12]: 
    perm_res_p25_kr26 perm_res_p75_kr26 
0   42.184047   61.637213 
1   39.728110   65.402140 
2   44.659760   63.693860 
3   40.600500   61.028790 
4   43.345030   62.407610 
5   42.420570   62.006120 
6   42.069150   59.607030 
7   42.184047   61.637213 
8   42.184047   61.637213 
9   42.184047   61.637213 
10   42.465210   57.314940 

這只是一個簡單的「平均」策略(不是你想要的),但你可以從Preprocessing data - custom-transformers瞭解更多關於這個,並實施你自己的策略來恢復你丟失的數據。

希望這會有所幫助。