數字數據轉換爲特徵向量

我想使用此代碼標準化數值數據爲特徵向量：數字數據轉換爲特徵向量

import numpy as np 
import pandas as pd 
import csv 

def clearRegister(): 
    clear_register = [] 
    zero = 0 
    for i in range(21): 
     clear_register.append(0) 
    return clear_register 

def header(): 
    clear_register = [] 
    name = 'c' 
    entry = 1 
    for i in range(21): 
     clear_register.append(name+str(entry)) 
     entry += 1 
    return clear_register 

def convert(filename): 
    clear_dataset = [] 
    clear_dataset.append(header()) 
    with open(filename) as csvfile: 
     reader = csv.DictReader(csvfile) 
     for row in reader: 
      clear_register = clearRegister() 
      clear_register[(int(row["blue1"])-1)] = 1 
      clear_register[(int(row["blue2"])-1)] = 1 
      clear_register[(int(row["blue3"])-1)] = 1 
      clear_register[(int(row["red1"])+9)] = 1 
      clear_register[(int(row["red2"])+9)] = 1 
      clear_register[(int(row["red3"])+9)] = 1

這裏是我的csvfile輸入：

row blue1 blue2 blue3 red1 red2 red3 lable 
0 1 5 4 6 2 8 0 
1 2 3 1 9 4 5 1 
. . . . . . . . 
3000 5 7 4 3 8 10 1

我期待這樣的輸出（C1- C10爲藍色，C11 - C20爲紅色）：

c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 c15 c16 c17 c18 c19 c20 lable 
1 0 0 1 1 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 
1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 1 
. . . . . . . . . . . . . . . . . . . . . 
0 0 0 1 1 0 1 0 0 0 1 0 0 0 0 0 0 1 0 1 1

C11 - C20是 '紅色' 代表它們都是獨一無二的。如果c1，c5，c10的值爲1，那麼c11，c15，c20就不能有這個值。

我試圖把它稱爲：

df = convert("dataset.csv") 
df1 = pd.DataFrame(df) 
print(df1)

我得到了這樣的結果：

Empty DataFrame 
Columns: [] 
Index: []

有什麼問題或與代碼欠缺？

來源

2017-10-05 Reza Ardiansyah

有藍天航空公司的posibility = blue2 = blue3，對於紅色也是一樣，你實際需要的是計數？或者答案總是二進制 – DJK

總是二進制。我忘了提及它們對於兩者都不重複（唯一），所以如果c1的值爲1，則作爲紅色c1的代表的c11將不具有相同的值。 –

考慮一個熊貓的解決方案，而不是csv操作，使用loc來反覆創建新的c1-c20列。用隨機數據如下演示：

數據（僅適用於問題的讀者，其中OP使用實際CSV代替）

import numpy as np 
import pandas as pd 

pd.set_option('display.width', 1000) 
pd.set_option('display.max_columns', 25) 

np.random.seed(5005) 
df = pd.DataFrame({'row': range(3000), 
        'blue1': [np.random.randint(11) for _ in range(3000)], 
        'blue2': [np.random.randint(11) for _ in range(3000)], 
        'blue3': [np.random.randint(11) for _ in range(3000)], 
        'red1': [np.random.randint(11) for _ in range(3000)], 
        'red2': [np.random.randint(11) for _ in range(3000)], 
        'red3': [np.random.randint(11) for _ in range(3000)], 
        'lable': [0,1]*1500}) 

print(df.head()) 
# blue1 blue2 blue3 lable red1 red2 red3 row 
# 0  4  5  5  0 10  0  8 0 
# 1  7  2  2  1  3  8  8 1 
# 2  2  4  0  0  8  1  7 2 
# 3  4  5  8  1  9  8  1 3 
# 4  0  1  5  0  5  6  9 4

過程

for i in range(1,11):  
    df.loc[(df['blue1'] == i) | (df['blue2'] == i) | (df['blue3'] == i), 'c'+str(i)] = 1 
    df.loc[(df['red1'] == i) | (df['red2'] == i) | (df['red3'] == i), 'c'+str(i+10)] = 1 

# SELECT AND RE-ORDER COLUMNS, FILL IN NANs, CONVERT TO INT TYPE 
df = df[['c'+str(i) for i in range(1,21)]+['lable']].fillna(0).astype(int) 

print(df.head())  
# c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 c15 c16 c17 c18 c19 c20 lable 
# 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1  0 
# 1 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0  1 
# 2 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0  0 
# 3 0 0 0 1 1 0 0 1 0 0 1 0 0 0 0 0 0 1 1 0  1 
# 4 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0  0

來源

2017-10-05 19:00:03 Parfait

雖然給出的例子是21x3000，我真正的數據集轉換包含277列和39500行，這使得執行運行非常緩慢......無論如何，我真的很感謝你的幫助。非常感謝你！ –

數字數據轉換爲特徵向量

回答

相關問題