Pandas by []拋出索引超出界限錯誤，但.ix不是

我一直試圖做一個函數，從數據集生成分層樣本（因爲sklearn沒有這樣的功能），我已經出現了與一個。Pandas by []拋出索引超出界限錯誤，但.ix不是

了以下功能生成的索引，我希望切片與原始數據集，但由於某些原因，當它到達

sampleData = dataset[indexes]

線，它拋出一個

IndexError: indices are out-of-bounds

錯誤。然而，

sampleData = dataset.ix[indexes]

的作品。但是，我有一種感覺，這是錯誤的，並搞砸我後來的過程。任何人有任何想法？ :)

下面是完整的代碼到這一點：

def stratifiedSampleGenerator(dataset,target,subsample_size=0.1): 
    print('Generating stratified sample of size ' + str(round(len(dataset)*subsample_size,2))) 
    dic={} 
    indexes = np.array([]) 
    # find number of classes in sample 
    for label in target.unique(): 
     labelSize = len(target[target==label]) 
     dic[label] = int(labelSize * subsample_size) 
    # make a dataset of size sizeSample with ratio of classes in dic 
    for label in dic: 
     classIndex = target[target==label].index #obtain indexes of class 
     counts = dic[label] #get number of times class occurs 
     newIndex = np.random.choice(classIndex,counts,replace=False) 
     indexes = np.concatenate((indexes,newIndex),axis=0) 

    indexes = indexes.astype(int) 
    sampleData = dataset[indexes] #throws error 
    sampleData = dataset.ix[indexes] #doesnt

謝謝！ :)

來源

2016-04-15 Wboy

實際上，sklearn的確有分層數據集的方式。

在你的情況下不會有這樣的事情嗎？

from sklearn.cross_validation import train_test_split 

dataset = ['A']*100 + ['B']*20 + ['C']*10 
target = [0]*100 + [1]*20 + [2]*10 
X_fit,X_eval,y_fit,y_eval= train_test_split(dataset,target,test_size=0.1,stratify=target) 
print X_eval.count('A') # output: 10 
print X_eval.count('B') # output: 2 
print X_eval.count('C') # output: 1

檢查文檔在這裏：http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html

來源

2016-04-15 15:00:37

你好，有什麼即時尋找的是一個分層抽樣。如果我沒有錯，那麼sklearn中的函數會爲整個數據集生成分層摺疊。所以尺寸仍然是一樣的。例如，原始數據集：100A 20B 10C 分層樣本：10A 2B 1C – Wboy

在我的示例中，「X_eval」和「y_eval」將包含大小爲0.1 * total_dataset_size的分層子採樣。這不是你想要的嗎？ –

剛剛更新了示例，您可以馬上運行它，還打印輸出，以便您可以看到它需要子採樣大小，但它保留了比例 –

Pandas by []拋出索引超出界限錯誤，但.ix不是

回答

相關問題