2017-10-05 221 views
0

我想使用此代碼標準化數值數據爲特徵向量:數字數據轉換爲特徵向量

import numpy as np 
import pandas as pd 
import csv 

def clearRegister(): 
    clear_register = [] 
    zero = 0 
    for i in range(21): 
     clear_register.append(0) 
    return clear_register 

def header(): 
    clear_register = [] 
    name = 'c' 
    entry = 1 
    for i in range(21): 
     clear_register.append(name+str(entry)) 
     entry += 1 
    return clear_register 

def convert(filename): 
    clear_dataset = [] 
    clear_dataset.append(header()) 
    with open(filename) as csvfile: 
     reader = csv.DictReader(csvfile) 
     for row in reader: 
      clear_register = clearRegister() 
      clear_register[(int(row["blue1"])-1)] = 1 
      clear_register[(int(row["blue2"])-1)] = 1 
      clear_register[(int(row["blue3"])-1)] = 1 
      clear_register[(int(row["red1"])+9)] = 1 
      clear_register[(int(row["red2"])+9)] = 1 
      clear_register[(int(row["red3"])+9)] = 1 

這裏是我的csvfile輸入:

row blue1 blue2 blue3 red1 red2 red3 lable 
0 1 5 4 6 2 8 0 
1 2 3 1 9 4 5 1 
. . . . . . . . 
3000 5 7 4 3 8 10 1 

我期待這樣的輸出(C1- C10爲藍色,C11 - C20爲紅色):

c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 c15 c16 c17 c18 c19 c20 lable 
1 0 0 1 1 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 
1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 1 
. . . . . . . . . . . . . . . . . . . . . 
0 0 0 1 1 0 1 0 0 0 1 0 0 0 0 0 0 1 0 1 1 

C11 - C20是 '紅色' 代表它們都是獨一無二的。如果c1,c5,c10的值爲1,那麼c11,c15,c20就不能有這個值。

我試圖把它稱爲:

df = convert("dataset.csv") 
df1 = pd.DataFrame(df) 
print(df1) 

我得到了這樣的結果:

Empty DataFrame 
Columns: [] 
Index: [] 

有什麼問題或與代碼欠缺?

+0

有藍天航空公司的posibility = blue2 = blue3,對於紅色也是一樣,你實際需要的是計數?或者答案總是二進制 – DJK

+0

總是二進制。我忘了提及它們對於兩者都不重複(唯一),所以如果c1的值爲1,則作爲紅色c1的代表的c11將不具有相同的值。 –

回答

1

考慮一個熊貓的解決方案,而不是csv操作,使用loc來反覆創建新的c1-c20列。用隨機數據如下演示:

數據(僅適用於問題的讀者,其中OP使用實際CSV代替)

import numpy as np 
import pandas as pd 

pd.set_option('display.width', 1000) 
pd.set_option('display.max_columns', 25) 

np.random.seed(5005) 
df = pd.DataFrame({'row': range(3000), 
        'blue1': [np.random.randint(11) for _ in range(3000)], 
        'blue2': [np.random.randint(11) for _ in range(3000)], 
        'blue3': [np.random.randint(11) for _ in range(3000)], 
        'red1': [np.random.randint(11) for _ in range(3000)], 
        'red2': [np.random.randint(11) for _ in range(3000)], 
        'red3': [np.random.randint(11) for _ in range(3000)], 
        'lable': [0,1]*1500}) 

print(df.head()) 
# blue1 blue2 blue3 lable red1 red2 red3 row 
# 0  4  5  5  0 10  0  8 0 
# 1  7  2  2  1  3  8  8 1 
# 2  2  4  0  0  8  1  7 2 
# 3  4  5  8  1  9  8  1 3 
# 4  0  1  5  0  5  6  9 4 

過程

for i in range(1,11):  
    df.loc[(df['blue1'] == i) | (df['blue2'] == i) | (df['blue3'] == i), 'c'+str(i)] = 1 
    df.loc[(df['red1'] == i) | (df['red2'] == i) | (df['red3'] == i), 'c'+str(i+10)] = 1 

# SELECT AND RE-ORDER COLUMNS, FILL IN NANs, CONVERT TO INT TYPE 
df = df[['c'+str(i) for i in range(1,21)]+['lable']].fillna(0).astype(int) 

print(df.head())  
# c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 c15 c16 c17 c18 c19 c20 lable 
# 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1  0 
# 1 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0  1 
# 2 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0  0 
# 3 0 0 0 1 1 0 0 1 0 0 1 0 0 0 0 0 0 1 1 0  1 
# 4 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0  0 
+0

雖然給出的例子是21x3000,我真正的數據集轉換包含277列和39500行,這使得執行運行非常緩慢......無論如何,我真的很感謝你的幫助。非常感謝你 ! –