2016-08-16 58 views
1

條件迭代我有一個數據幀df看起來像:在數據幀

id    location grain 
0 BBG.XETR.AD.S  XETR 16.545 
1 BBG.XLON.VB.S  XLON 6.2154 
2 BBG.XLON.HF.S  XLON NaN 
3 BBG.XLON.RE.S  XLON NaN 
4 BBG.XLON.LL.S  XLON NaN 
5 BBG.XLON.AN.S  XLON 3.215 
6 BBG.XLON.TR.S  XLON NaN 
7 BBG.XLON.VO.S  XLON NaN 

在現實中這個數據幀會大很多。我想迭代這個數據幀返回'grain'值,但我只對'grain'列中有值(不是NaN)的行感興趣。因此,只有回國,因爲我遍歷數據框以下值:

16.545 
6.2154 
3.215 

我可以遍歷使用數據框:

for staticidx, row in df.iterrows(): 
      value= row['grain'] 

但這返回包括那些NaN值的所有行的值。是否有辦法從數據框中刪除NaN行或跳過數據框中的grain等於NaN的行?

非常感謝

+1

'df.grain [〜pd.isnull(df.grain)]'? – Psidom

+1

或者你可以這樣做:'df.ix [df.grain.notnull(),'grain']' – MaxU

回答

1

可以在dropna上指定列的列表,以子集數據:

子集:陣列狀沿另一個軸 標籤來考慮,例如如果要刪除的行 這些將是列的列表,包括

>>> df.dropna(subset=['grain']) 
       id location grain 
0 BBG.XETR.AD.S  XETR 16.5450 
1 BBG.XLON.VB.S  XLON 6.2154 
5 BBG.XLON.AN.S  XLON 3.2150 
0

此:

df[pd.notnull(df['grain'])] 

或者這樣:

df['grain].dropna() 
0

讓我們來比較不同方法(800K行DF):

In [21]: df = pd.concat([df] * 10**5, ignore_index=True) 

In [22]: df.shape 
Out[22]: (800000, 3) 

In [23]: %timeit df.grain[~pd.isnull(df.grain)] 
The slowest run took 5.33 times longer than the fastest. This could mean that an intermediate result is being cached. 
100 loops, best of 3: 17.1 ms per loop 

In [24]: %timeit df.ix[df.grain.notnull(), 'grain'] 
10 loops, best of 3: 23.9 ms per loop 

In [25]: %timeit df[pd.notnull(df['grain'])] 
10 loops, best of 3: 35.9 ms per loop 

In [26]: %timeit df.grain.ix[df.grain.notnull()] 
100 loops, best of 3: 17.4 ms per loop 

In [27]: %timeit df.dropna(subset=['grain']) 
10 loops, best of 3: 56.6 ms per loop 

In [28]: %timeit df.grain[df.grain.notnull()] 
100 loops, best of 3: 17 ms per loop 

In [30]: %timeit df['grain'].dropna() 
100 loops, best of 3: 16.3 ms per loop