如何操作我的數據以允許隨機森林在其上運行？

我想在一堆矩陣上訓練一個隨機森林（例如下面的第一個鏈接）。我想將它們歸類爲「g」或「b」（好或壞，a或b，1或0，沒關係）。如何操作我的數據以允許隨機森林在其上運行？

我已經調用腳本randfore.py。我目前使用了10個示例，但是一旦實際啓動並運行，我將使用更大的數據集。

下面是代碼：

# -*- coding: utf-8 -*- 
import numpy as np 
import pandas as pd 
import os 

import sklearn 
from sklearn.tree import DecisionTreeClassifier 
from sklearn.ensemble import RandomForestClassifier 

working_dir = os.getcwd() # Grabs the working directory 

directory = working_dir+"/fakesourcestuff/" ## The actual directory where the files are located 

sources = list() # Just sets up a list here which is going to become the input for the random forest 

for i in range(10): 
    cutoutfile = pd.read_csv(directory+ "image2_with_fake_geotran_subtracted_corrected_cutout_" + str(i) +".dat", dtype=object) ## Where we get the input data for the random forest from 
    sources.append(cutoutfile) # add it to our sources list 

targets = pd.read_csv(directory + "faketargets.dat",sep='\n',header=None, dtype=object) # Reads in our target data... either "g" or "b" (Good or bad) 


sources = pd.DataFrame(sources) ## I convert the list to a dataframe to avoid the "ValueError: cannot copy sequence with size 99 to array axis with dimension 1" error. Necessary? 

# Training sets 
X_train = sources[:8] # Inputs 
y_train = targets[:8] # Targets 

# Random Forest 
rf = RandomForestClassifier(n_estimators=10) 
rf_fit = rf.fit(X_train, y_train)

下面是電流誤差輸出：

Traceback (most recent call last): 
    File "randfore.py", line 31, in <module> 
    rf_fit = rf.fit(X_train, y_train) 
    File "/home/ithil/anaconda2/envs/iraf27/lib/python2.7/site-packages/sklearn/ensemble/forest.py", line 247, in fit 
    X = check_array(X, accept_sparse="csc", dtype=DTYPE) 
    File "/home/ithil/anaconda2/envs/iraf27/lib/python2.7/site-packages/sklearn/utils/validation.py", line 382, in check_array 
    array = np.array(array, dtype=dtype, order=order, copy=copy) 
ValueError: setting an array element with a sequence.

我試圖使D型=對象，但它並沒有幫助。我只是不確定我需要執行哪種操作才能完成這項工作。

我認爲這個問題是因爲我附加到源文件不只是數字，而是數字，逗號和各種方括號（它基本上是一個大矩陣）的混合。有沒有一種自然的方式來導入？方括號特別可能是一個問題。

之前，我轉換源到數據幀，我收到以下錯誤：

ValueError: cannot copy sequence with size 99 to array axis with dimension 1 This is due to the dimensions of my input (100 lines long) and my target which has 10 rows and 1 column.

這裏是一個讀過入切口的第一個文件的內容（他們都是完全一樣的風格），使用作爲輸入： https://pastebin.com/tkysqmVu

這裏是faketargets.dat的內容，目標： https://pastebin.com/632RBqWc

任何想法？非常感謝。我相信這裏會有很多根本性的混亂。

來源

2017-06-15 Edmond Dantès

根據[docs]（http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html），輸入預期爲2D，但您正在列出2D對象，因此它是3D。你需要扁平化你的二維數組（如果有意義的話）或者研究特徵生成。 – ncfirth

@ncfirth啊，謝謝。有沒有一種簡單的方法將這個列表（或它變成的數據框）轉換成一維數組？或者我可以變平的2D數組（我認爲是.flatten）。 –

嘗試寫：

X_train = sources.values[:8] # Inputs 
y_train = targets.values[:8] # Targets

我希望這將解決您的問題！

來源

2017-07-20 10:34:15 Blessy

如何操作我的數據以允許隨機森林在其上運行？

回答

相關問題