如何一次讀取熊貓數據框的兩行和兩列並對這些行/列值應用條件？

我想在pandas Dataframe中一次讀取兩行和兩列，然後在pandas Dataframe的兩行/列矩陣之間應用條件依賴的zip vs. product。如何一次讀取熊貓數據框的兩行和兩列並對這些行/列值應用條件？

import pandas as pd 
import itertools as it 
from itertools import product 

cond_mcve = pd.read_csv('condition01.mcve.txt', sep='\t') 

    alfa alfa_index beta beta_index delta delta_index 
0 a,b   23 c,d   36 a,c   32 
1 a,c   23 b,e   37 c,d   32 
2 g,h   28 d,f   37 e,g   32 
3 a,b   28 c,d   39 a,c   34 
4 c,e   28 b,g   39 d,k   34

這裏阿爾法，β和δ是字符串值，並且他們有自己相應的指標。
我要創建兩個zip相鄰串（按行），如果他們有相同的指數值。 因此，對於alfa column的前兩行，輸出應爲aa,cb，因爲兩行的alfa_index爲23。
但是，對於阿爾法列的第二和第三行中的兩個索引值不同（23和28），因此，輸出應爲字符串的產物，即輸出：GA，GC，哈，HC

這是我精神上想過這樣做時： 而且，我希望，我非常清楚地說明問題。

# write a function 
def some_function(): 
    read_two columns at once (based on prefix similarity) 

    then: 
    if two integer_index are same: 
     zip(of strings belonging to that index) 

    if two integer index are different: 
     product(of strings belonging to that index) 

# take this function and apply it to pandas dataframe: 
cond_mcve_updated = cond_mcve+cond_mcve.shift(1).dropna(how='all').applymap(some_function)

這裏shift能夠一次讀取兩行，所以我在同一時間閱讀兩行問題就解決了。 不過，我有在閱讀兩列和實施條件的其他問題：

讀取兩個列在一次大熊貓數據幀（基於前綴的相似性）。
分離這些列進行比較的指標值（整數）
申請基於所述條件拉鍊與產品

預期的最終輸出將是：

alfa   alfa_index beta    beta_index delta delta_index 
1 aa,cb   23   bc,bd,ec,ed  37   ca,dc   32 
2 ga,gc,ha,hc 28   db,fe   37   ec,gd   32 
same for other line..... 

# the first index(i.e 0 is lost) but that's ok. I can work it out using `head/tail` method in pandas.

來源

2017-03-15 everestial007

下面是一個方法達到結果。此功能使用shift,concat和apply將數據運行到一個函數，該函數可以根據匹配的值執行prod/sum事件。

代碼：

import itertools as it 

def crazy_prod_sum_thing(frame): 
    # get the labels which do not end with _index 
    labels = [(l, l + '_index') 
       for l in frame.columns.values if not l.endswith('_index')] 

    def func(row): 
     # get row n and row n-1 
     front = row[:len(row) >> 1] 
     back = row[len(row) >> 1:] 

     # loop through the labels 
     results = [] 
     for l, i in labels: 
      x = front[l].split(',') 
      y = back[l].split(',') 
      if front[i] == back[i]: 
       results.append(x[0] + y[0] + ',' + x[1] + x[1]) 
      else: 
       results.append(
        ','.join([x1 + y1 for x1, y1 in it.product(x, y)])) 

     return pd.Series(results) 

    # take this function and apply it to pandas dataframe: 
    df = pd.concat([frame, frame.shift(1)], axis=1)[1:].apply(
     func, axis=1) 

    df.rename(columns={i: x[0] + '_cpst' for i, x in enumerate(labels)}, 
       inplace=True) 
    return pd.concat([frame, df], axis=1)

測試代碼：

import pandas as pd 
from io import StringIO 
data = [x.strip() for x in """ 
     alfa alfa_index beta beta_index delta delta_index 
    0 a,b   23 c,d   36 a,c   32 
    1 a,c   23 b,e   37 c,d   32 
    2 g,h   28 d,f   37 e,g   32 
    3 a,b   28 c,d   39 a,c   34 
    4 c,e   28 b,g   39 d,k   34 
""".split('\n')[1:-1]] 
df = pd.read_csv(StringIO(u'\n'.join(data)), sep='\s+') 
print(df) 

print(crazy_prod_sum_thing(df))

結果：

alfa alfa_index beta beta_index delta delta_index 
0 a,b   23 c,d   36 a,c   32 
1 a,c   23 b,e   37 c,d   32 
2 g,h   28 d,f   37 e,g   32 
3 a,b   28 c,d   39 a,c   34 
4 c,e   28 b,g   39 d,k   34 

1   [aa,cc, bc,bd,ec,ed, ca,dd] 
2   [ga,gc,ha,hc, db,ff, ec,gg] 
3 [ag,bb, cd,cf,dd,df, ae,ag,ce,cg] 
4    [ca,ee, bc,gg, da,kk]

注意：

這不會將問題的結果封送回問題中指出的數據框中，因爲我不確定如何在索引值不匹配時採取這些索引值。

來源

2017-03-15 20:26:01

這必須是可行的。如果有辦法保留索引值，我會嘗試鍛鍊。非常感謝。我還沒有完全接受答案，但同時也希望能夠增加一些其他的答案，就我所知。等待幾天來獲得一些關於這個問題的關注。謝謝。 – everestial007

我只是嘗試了代碼，但在print（crazy_prod_sum_thing（df））過程中遇到錯誤** **錯誤消息：** TypeError :(無法對用這些索引器[6.0]的「，'發生在索引1'）'它提示了一些關於'float'的內容，但索引值應該是整數。可能是什麼問題？ – everestial007

兩個都試過。我還打印了您創建的文件和我發佈的文件的輸出。兩者是完全相同和相同的類型。兩者都給出完全相同的錯誤信息。 idk爲什麼？ – everestial007

如何一次讀取熊貓數據框的兩行和兩列並對這些行/列值應用條件？

回答

相關問題