一個數據幀中的分裂列到新列的值過濾

我有一個數據幀，它看起來如下，一個數據幀中的分裂列到新列的值過濾

Head1 Header2 
ABC SAP (+115590), GRN (+426250)  
EFG HES3 (-6350), CMT (-1902) 
HIJ CORT (-19440), API (+177) 
KLM AAD (-25488), DH(-1341) ,DSQ(+120001) 
SOS MFA (-11174), 13A2 (+19763)

，我需要第二列用逗號分開，並創建中新列相同的數據幀。除此之外，我需要取出方括號內的所有值，並使用該數字信息創建另一列以進一步進行過濾。

到目前爲止，我可以用一個不那麼優雅的代碼來做到這一點，它是如此漫長如下，

Trans = 'file.txt' 
Trans = pd.read_csv(Trans, sep="\t", header=0) 
Trans.columns=["RNA","PCs"] 


    # Here I changed the dtype to string to do split 
    Trans.PCs=Trans.PCs.astype(str) 
#I took out those first part of second column into new column PC1 
    Trans["PC1"]=Trans.PCs.str.extract('(\w*)', expand=True) 
    #Here I splited the neuwmric informationf rom first part 
    Trans[['Strand1','Dis1']] = Trans.PCs.str.extract('([+-])(\d*)', expand=True) 
Trans.head() 


    Head Header2      Head1 Strand1 Dis1 
    ABC SAP (+11559), GRN (+42625) SAP  +  115590 
    EFG HES3 (-6350), CMT (-1902) HES3  -  6350 
    HIJ CORT (-19440), API (+177) CORT  -  19440 
    KLM AAD (-25488), DH(-1341)  AAD  -  25488 
    SOS MFA (-11174), 13A2 (+19763) MFA  -  11174

我需要上面的數據幀再分割，因此使用下面的我一段代碼列2

 # this for second part of 2nd column 
     Trans["PC2"]=Trans.PCs.str.split(',').str.get(1) 
     # did for neumric information 
     Trans[['Strand2','Dis2']] = Trans.PC2.str.extract('([+-])(\d*)', expand=True)

Trans['PC2']=Trans.PC2.str.replace(r"\(.*\)","") 

# At this point the daframe looks like this, 
Head Header2    Head1   Strand1   Dis1  Head2  Strand2 Dis2 
ABC SAP (+11559), GRN (+42625) SAP  + 115590  GRN + 426250 
EFG HES3 (-6350), CMT (-1902) HES3 - 6350 CMT - 1902 
HIJ CORT (-19440), API (+177) CORT - 19440 API + 177 
KLM AAD (-25488), DH(-1341)  AAD  - 25488 DH - 1341 
SOS MFA (-11174), 13A2 (+19763),DSQ(+120001) MFA  - 11174 13A2 + 19763

的第二部分

Trans=Trans.fillna(0) 
    Trans.Dis1=Trans.Dis1.astype(int) 
    Trans.Dis2=Trans.Dis2.astype(int)

# Here I am filtering the rows based on Dis1 and Dis2 columns from daframe 
>   Trans_Pc1=Trans.loc[:,"lncRNA":"Dis1"].query('Dis1 >= 100000') 
>   Trans_Pc2=Trans.loc[:,"PC2":"Dis2"].query('Dis2 >= 100000') 
>   TransPC1=Trans_Pc1.PC1 
>   TransPC2=Trans_Pc2.PC2 
>   TransPCs=pd.concat([TransPC1,TransPC2])

它看上去是這樣，

Header 
SAP 
GRN 
DSQ

即使腳本是漫長的工作，但我有問題，當第二列有喜歡這裏的分隔值超過2個逗號行行，

KLM AAD (-25488), DH(-1341) ,DSQ(+120001)

它有三個逗號分隔值，我知道我必須再次重複分裂，但我的數據幀是非常大的，有馬ny行不等逗號分隔值。例如，某些行的第2列有2個逗號分隔值，有些行的逗號分隔值爲5，依此類推。

任何更好的方式來篩選我的框架將是偉大的。最終，我的目標一個數據幀如下，

header 
SAP 
GRN 
DSQ

任何幫助或建議將是真正偉大

來源

2016-07-07 user1017373

嘗試：

df = pd.DataFrame(
    [ 
     ['ABC', 'SAP (+115590), GRN (+426250)'], 
     ['EFG', 'HES3 (-6350), CMT (-1902)'], 
     ['HIJ', 'CORT (-19440), API (+177)'], 
     ['KLM', 'AAD (-25488), DH(-1341) ,DSQ(+120001)'], 
     ['SOS', 'MFA (-11174), 13A2 (+19763)'], 
    ], columns=['Head1', 'Header2']) 

df1 = df.Header2.str.split(',', expand=True) 

regex = r'(?P<Head>\w+).*\((?P<Strand>[+-])(?P<Dis>.*)\)' 
extract = lambda df: df.iloc[0].str.extract(regex, expand=True) 

extracted = df1.groupby(level=0).apply(extract) 

df2 = extracted.stack().unstack([2, 1]) 

colseries = df2.columns.to_series() 
df2.columns = colseries.str.get(0).astype(str) + colseries.str.get(1).astype(str) 

pd.concat([df, df2], axis=1)

來源

2016-07-07 10:59:42 piRSquared

謝謝你的簡單方法，但正如你可以在後期看到的，我需要進一步基於Dis *列進行過濾，並返回那些Dis *> = 100000的行，所以當有很多Dis *列時我的意思是，以下是當我只有兩個Dis列時，我嘗試了Trans_Pc1 = Trans.loc [：，「Head1」：「Dis1」]。query（'Dis1> = 100000'） > Trans_Pc2 = Trans.loc [：，「Head2」：「Dis2」]。query（'Dis2> = 100000'） – user1017373

一個數據幀中的分裂列到新列的值過濾

回答

相關問題