2017-04-12 48 views
0

我想用pd.merge近10個文件每個文件都有數據是這樣的,合併基於從多個文件三列

chrom start end name score strand splice_site acceptors_skipped exons_skipped donors_skipped anchor known_donor known_acceptor known_junction genes transcripts 
4  3487839 3491240 JUNC00148541 101 - GT-AG 2 1 3 DA 1 1 1 Tmem68 ENSMUST00000029891,ENSMUST00000108388,ENSMUST00000154922 
4  3489293 3491240 JUNC00148543 1 - GT-AG 1 0 1 DA 1 1 1 Tmem68 ENSMUST00000029891,ENSMUST00000108388,ENSMUST00000154922 

我已經在過去使用合併使用pd.merge(df_a, df_b, on='gene', how='outer')通過合併只有一列,在這裏我想根據鉻,開始和結束和鏈合併它們。

我的新DF會是什麼樣子

chrm:start-end(strand) score_file1 score_file2 ...file10 gene_name splice_site acceptores exon_skipped donors_skipped...transcripts 

如果沒有匹配how='outer'我相信會進入NaN值。如何減少內存使用量,最好的方法是什麼?

path = r'/Users/PycharmProjects/' 
all_files = glob.glob(os.path.join(path, "*_bed.txt")) 
print(all_files) 
df1 = pd.read_table(all_files[0]) 
df2= pd.read_table(all_files[1]) 

concatnated_df = pd.merge(df1,df2, on=['genes','chrom','start','end'], how='outer') 
print(concatnated_df.head(n=5)) 

任何幫助表示讚賞!

更新簡化問題:

chr start end score strand gene 
1 20 30 50 -  abc1 
2 40 50 50 +  cdf1 

10這樣的數據CSV文件,在CHR合併它們基於(精確匹配),起始端和基因 新的DF

chr start end score_file1 score_file2..file10 strand gene 
1 20 30 50 20 40 -  abc1 
2 40 50 50 30 50 +  cdf1 
+0

你可以傳遞列作爲列表:pd.merge(df_a,df_b,on = ['gene','chrom','start'],how ='outer') – Shahram

+0

我猜,但它的工作問題是它追加整個標題以及chrom \t開始\t端\t name_x \t score_x \t strand_x \t splice_site_x \t acceptors_skipped_x \t exons_skipped_x \t donors_skipped_x \t anchor_x \t known_donor_x \t known_acceptor_x \t known_junction_x \t基因\t transcripts_x \t name_y \t score_y \t strand_y \t splice_site_y \t acceptors_skipped_y \t exons_skipped_y \t donors_skipped_y \t anchor_y \t known_donor_y \t known_acceptor_y \t known_junction_y \t transcripts_y – sbradbio

+0

有沒有更好的方法來做到這一點? – sbradbio

回答

0
dfs = [df1[['chr','gene','start','end','score']], 
     df2[['chr','gene','start','end','score']], 
     df3[['chr','gene','start','end','score']], 
     df10[['chr','gene','start','end','score']]] 
df_final = reduce(lambda left,right: pd.merge(left,right,on= 
        ['gene','chr','start','end'], how='outer'),dfs) 
+0

試過上面的代碼不能夠附加每個df的列我剛剛拿到4列chrom \t基因\t開始\t結束\t得分 – sbradbio