Python：遍歷數據框列，檢查存儲在數組中的條件值，並獲取值列表

在論壇中的一些幫助後，我設法做我正在尋找的東西，現在我需要得到到下一個級別。（長解釋是在這裏： Python Data Frame: cumulative sum of column until condition is reached and return the index）：Python：遍歷數據框列，檢查存儲在數組中的條件值，並獲取值列表

我有一個數據幀：

In [3]: df 
Out[3]: 
    index Num_Albums Num_authors 
0  0   10   4 
1  1   1   5 
2  2   4   4 
3  3   7   1000 
4  4   1   44 
5  5   3   8

我添加一列與另一列的累積總和。

In [4]: df['cumsum'] = df['Num_Albums'].cumsum() 

In [5]: df 
Out[5]: 
    index Num_Albums Num_authors cumsum 
0  0   10   4  10 
1  1   1   5  11 
2  2   4   4  15 
3  3   7   1000  22 
4  4   1   44  23 
5  5   3   8  26

然後，我申請到cumsum列的條件，我提取其中滿足條件與給定公差行的相應值：

In [18]: tol = 2 

In [19]: cond = df.where((df['cumsum']>=15-tol)&(df['cumsum']<=15+tol)).dropna() 

In [20]: cond 
Out[20]: 
    index Num_Albums Num_authors cumsum 
2 2.0   4.0   4.0 15.0

現在，我要做的是在示例中替代條件15，條件存儲在一個數組中。檢查條件是否滿足，並檢索不是整行，而只檢索列的值Num_Albums。最後，所有這些檢索值（每個條件一個）存儲在數組或列表中。從MATLAB來，我會做這樣的事情（我這個混合MATLAB/Python語法道歉）：

conditions = np.array([10, 15, 23]) 
for i=0:len(conditions) 
    retrieved_values(i) = df.where((df['cumsum']>=conditions(i)-tol)&(df['cumsum']<=conditions(i)+tol)).dropna()

因此對於數據幀以上我會得到（爲tol=0）：

retrieved_values = [10, 4, 1]

我想要一個解決方案，可以讓我保留.where函數，如果可能的話。

來源

2017-01-09 AMaz

那麼輸出不總是爲1的數字吧？的情況下，輸出中是精確的1號，你可以這樣寫代碼

tol = 0 
#condition 
c = [5,15,25] 
value = [] 

for i in c: 
    if len(df.where((df['a'] >= i - tol) & (df['a'] <= i + tol)).dropna()['a']) > 0: 
     value = value + [df.where((df['a'] >= i - tol) & (df['a'] <= i + tol)).dropna()['a'].values[0]] 
    else: 
     value = value + [[]] 
print(value)

輸出應該像

[1,2,3]

的情況下，輸出可有多個號碼和想成爲這樣的

[[1.0, 5.0], [12.0, 15.0], [25.0]]

您可以使用此代碼

tol = 5 
c = [5,15,25] 
value = [] 

for i in c: 
    getdatas = df.where((df['a'] >= i - tol) & (df['a'] <= i + tol)).dropna()['a'].values 
    value.append([x for x in getdatas]) 
print(value)

來源

2017-01-09 10:53:56

我不斷收到：IndexError：索引0超出0軸的大小0 – AMaz

@Amaz是第一個選項還是第二個選項？第一將是indexError，因爲它需要.values [0]，需要事先驗證，讓我爲你編輯 –

一個快速的方法是利用NumPy的廣播技術作爲this answer的延伸鏈接，儘管實際上詢問了與使用DF.where有關的答案。

廣播消除了遍歷數組的每個元素的需求，並且它同時是高效的。

這篇文章的唯一補充是使用np.argmax獲取沿每列的第一個True實例的索引（遍歷↓方向）。

conditions = np.array([10, 15, 23]) 
tol = 0 
num_albums = df.Num_Albums.values 
num_albums_cumsum = df.Num_Albums.cumsum().values 
slices = np.argmax(np.isclose(num_albums_cumsum[:, None], conditions, atol=tol), axis=0)

檢索切片：

slices 
Out[692]: 
array([0, 2, 4], dtype=int64)

生產

相應的數組：

num_albums[slices] 
Out[693]: 
array([10, 4, 1], dtype=int64)

如果你還是喜歡使用DF.where，這裏是用list-comprehension另一種解決方案 -

[df.where((df['cumsum'] >= cond - tol) & (df['cumsum'] <= cond + tol), -1)['Num_Albums'] 
    .max() for cond in conditions] 
Out[695]: 
[10, 4, 1]

不滿足給定條件的條件將被替換爲-1。這樣做最後保留了dtype。

來源

2017-01-09 12:10:19

我其實更喜歡第一個選項。我不確定使用「無」對我來說是清楚的。通過應用你的建議得到的結果是，當條件不滿足時，「切片」的值爲0.當我調用「num_albums [slices]」時，我得到第一個值（在索引0）沒有遇到..當條件不滿足時，我怎麼能有「切片」是NaN？ – AMaz

'None'意思是'np.newaxis'，它簡單的說就是重新整形數組，以便爲它插入一個額外的維度，這使得我們可以在很多維度（這裏是二維數組）中查詢數組。出於同樣的目的，'num_albums_cumsum.reshape（-1，1）'也適用。不，'num_albums [slices]'爲您提供條件滿足的值。如果你想讓'NaN''出現在'False'條件下，那麼我建議你考慮'np.where'。但我不明白它的含義，因爲你只是想在列表/數組中獲取它們。 –

Python：遍歷數據框列，檢查存儲在數組中的條件值，並獲取值列表

回答

相關問題