2016-11-08 64 views
-1

我試圖做一個項目時遇到了這個錯誤:ValueError: Found arrays with inconsistent numbers of samples: [878049 884262]預測舊金山的犯罪,ValueError

當我嘗試在底部運行我的knn分類器時,我會得到它。我一直在閱讀它,我知道這是因爲我的X和Y不一樣。 X的形狀是(878049,2),y是(884262,)。

我該如何解決這個錯誤,使它們匹配?

代碼:

# drop features that we wont be using 
# train.head() 
df = train.drop(['Descript', 'Resolution', 'Address'], axis=1) 

df2 = test.drop(['Address'], axis=1) 

# trying to see the times during a day a particular crime occurs, for example 
# rapes occur more from 12am-4am during the weekend. 
# example below 
dow = { 
    'Monday':0, 
    'Tuesday':1, 
    'Wednesday':2, 
    'Thursday':3, 
    'Friday':4, 
    'Saturday':5, 
    'Sunday':6 
} 
df['DOW'] = df.DayOfWeek.map(dow) 

# Add column containing time of day 
df['Hour'] = pd.to_datetime(df.Dates).dt.hour 

# making my feature column 
feature_cols = ['DOW', 'Hour'] 
X = df[feature_cols] 

df2['DOW'] = df2.DayOfWeek.map(dow) 


y = df2['DOW'] 

# columns in X and y don't match 
print(X.shape) 
print(y.shape) 
print(y.head()) 
print(X.head()) 

# Knn classifier 
k = 5 
my_knn_for_cs4661 = KNeighborsClassifier(n_neighbors=k) 
my_knn_for_cs4661.fit(X, y) 

# KNN (with k=5), Decision Tree accuracy 
y_predict = my_knn_for_cs4661.predict(X) 
print('\n') 
score = accuracy_score(y, y_predict) 

print("K=",k,"Has ",score, "Accuracy") 
results = pd.DataFrame() 
results['actual'] = y 
results['prediction'] = y_predict 
print(results.head(10)) 

堆棧跟蹤:

--------------------------------------------------------------------------- 
ValueError        Traceback (most recent call last) 
<ipython-input-11-5a002c1fd668> in <module>() 
     7 k = 5 
     8 my_knn_for_cs4661 = KNeighborsClassifier(n_neighbors=k) 
----> 9 my_knn_for_cs4661.fit(X, y) 
    10 #KNN (with k=5), Decision Tree accuracy 
    11 y_predict = my_knn_for_cs4661.predict(X) 

C:\Users\Michael\Anaconda3\lib\site-packages\sklearn\neighbors\base.py in fit(self, X, y) 
    776   """ 
    777   if not isinstance(X, (KDTree, BallTree)): 
--> 778    X, y = check_X_y(X, y, "csr", multi_output=True) 
    779 
    780   if y.ndim == 1 or y.ndim == 2 and y.shape[1] == 1: 

C:\Users\Michael\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_X_y(X, y, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator) 
    518   y = y.astype(np.float64) 
    519 
--> 520  check_consistent_length(X, y) 
    521 
    522  return X, y 

C:\Users\Michael\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_consistent_length(*arrays) 
    174  if len(uniques) > 1: 
    175   raise ValueError("Found arrays with inconsistent numbers of samples: " 
--> 176       "%s" % str(uniques)) 
    177 
    178 

ValueError: Found arrays with inconsistent numbers of samples: [878049 884262] 
+0

你可以添加堆棧跟蹤嗎? –

+0

@SayaliSonawane好的,我添加了它 – lupejuares

+0

使用X.shape檢查X和Y的形狀。堆棧跟蹤說你在X和Y中有不同的實例。 –

回答

0

檢查X的形狀,並通過使用X.shape年。堆棧跟蹤表示您在X和y中有不同的實例(不包括樣本)。這就是fit函數拋出ValueError的原因。

參考documentation它指出:

"""Fit the model using X as training data and y as target values 
     Parameters 
     ---------- 
     X : {array-like, sparse matrix, BallTree, KDTree} 
      Training data. If array or matrix, shape [n_samples, n_features], 
      or [n_samples, n_samples] if metric='precomputed'. 
     y : {array-like, sparse matrix} 
      Target values, array of float values, shape = [n_samples] 
      or [n_samples, n_outputs] 
     """ 

簡單地說,

X is (878049, 2) -> n_samples = 878049 and n_features = 2 
y is (884262,) -> Here, n_samples = 884262 

你傳入額外的目標值。減少y中的目標值的數量。由於X的n_samples是878049,因此必須傳遞相同數量的目標值(878049)。

你可以試試:

my_knn_for_cs4661.fit(X, y[:878049]) 

參見: sklearn error ValueError: Input contains NaN, infinity or a value too large for dtype('float64')

接受的答案中指出: 「我的輸入陣列的尺寸爲偏斜,因爲我輸入CSV了空的空間。」

檢查源文件。

+0

謝謝你的解釋! – lupejuares