scikit-learn錯誤：y中人口最少的類只有1個成員

我試圖通過使用scikit-learn中的train_test_split函數將我的數據集分成一個訓練集和一個測試集，但是我收到此錯誤：scikit-learn錯誤：y中人口最少的類只有1個成員

In [1]: y.iloc[:,0].value_counts() 
Out[1]: 
M2 38 
M1 35 
M4 29 
M5 15 
M0 15 
M3 15 

In [2]: xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=1/3, random_state=85, stratify=y) 
Out[2]: 
Traceback (most recent call last): 
    File "run_ok.py", line 48, in <module> 
    xtrain,xtest,ytrain,ytest = train_test_split(X,y,test_size=1/3,random_state=85,stratify=y) 
    File "/home/aurora/.pyenv/versions/3.6.0/lib/python3.6/site-packages/sklearn/model_selection/_split.py", line 1700, in train_test_split 
    train, test = next(cv.split(X=arrays[0], y=stratify)) 
    File "/home/aurora/.pyenv/versions/3.6.0/lib/python3.6/site-packages/sklearn/model_selection/_split.py", line 953, in split 
    for train, test in self._iter_indices(X, y, groups): 
    File "/home/aurora/.pyenv/versions/3.6.0/lib/python3.6/site-packages/sklearn/model_selection/_split.py", line 1259, in _iter_indices 
    raise ValueError("The least populated class in y has only 1" 
ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

但是，所有類都至少有15個樣本。爲什麼我得到這個錯誤？

X是一個表示數據點的pandas DataFrame，y是一個包含目標變量的一列pandas DataFrame。

我不能發佈原始數據，因爲它是專有的，但通過創建具有1k行x 500列的隨機熊貓DataFrame（X）和具有相同行數的隨機熊貓DataFrame（y） 1k），併爲每一行的目標變量（一個分類標籤）。 y pandas DataFrame應該有不同的分類標籤（例如'class1'，'class2'...），每個標籤至少有15次出現。

來源

2017-04-03 Aurora

您應該發佈一個完整的，可複製的代碼片段，其中包含錯誤和數據樣本的完整堆棧跟蹤。 –

問題是train_test_split需要輸入2個數組，但y數組是一列矩陣。如果我只通過y的第一列就行了。

train, xtest, ytrain, ytest = train_test_split(X, y.iloc[:,1], test_size=1/3, 
    random_state=85, stratify=y.iloc[:,1])

來源

2017-04-03 09:36:11 Aurora

scikit-learn錯誤：y中人口最少的類只有1個成員

回答

相關問題