2017-02-16 62 views
0

我有兩個dfs,並尋找一種方法來根據df2中的行選擇(和計數)df1行。根據第二列中的行選擇行

這是我的DF1:

 Chromosome Start position End position Reference Variant reads \ 
0  chr1  109419841  109419841   C  T  1 
1  chr1  197008365  197008365   C  T  1 

    variation reads % variation     gDNA nomencl \ 
0    1   100 Chr1(GRCh37):g.109419841C>T 
1    1   100 Chr1(GRCh37):g.197008365C>T 

      cDNA nomencl ... exon transcript ID   inheritance \ 
0 NM_013296.4:c.-258C>T ...  2 NM_013296.4 Autosomal recessive 
1 NM_001994.2:c.*143G>A ...  UTR NM_001994.2 Autosomal recessive 

    test type      Phenotype male coverage male ratio covered \ 
0 Unknown Deafness, autosomal recessief    0     0 
1 Unknown   Factor 13 deficientie    0     0 

    female coverage female ratio covered ratio M:F 
0    1     1  0.0 
1    1     1  0.0 

DF1有這些列:

Chromosome    10561 non-null object 
Start position   10561 non-null int64 
End position    10561 non-null int64 
Reference     10415 non-null object 
Variant     10536 non-null object 
reads      10561 non-null int64 
variation reads   10561 non-null int64 
% variation    10561 non-null int64 
gDNA nomencl    10561 non-null object 
cDNA nomencl    10446 non-null object 
protein nomencl   9997 non-null object 
classification   10561 non-null object 
status     10561 non-null object 
gene      10560 non-null object 
Sanger sequencing list 10561 non-null object 
exon      10502 non-null object 
transcript ID    10460 non-null object 
inheritance    8259 non-null object 
test type     10561 non-null object 
Phenotype     10380 non-null object 
male coverage    10561 non-null int64 
male ratio covered  10561 non-null int64 
female coverage   10561 non-null int64 
female ratio covered  10561 non-null int64 

,這是DF2:

Chromosome Startposition Endposition Bases Meancoverage \ 
0  chr1  11073785  11074022 27831.0 117.927966 
1  chr1  11076901  11077064 11803.0  72.411043 

    Mediancoverage Ratiocovered>10X Ratiocovered>20X Genename Componentnr \ 
0   97.0    1.0    1.0 TARDBP   1 
1   76.0    1.0    1.0 TARDBP   2 

    PositionGenes   PositionGenome      Position 
0  TARDBP.1 chr1.11073785-11074022 comp.1_chr1.11073785-11074022 
1  TARDBP.2 chr1.11076901-11077064 comp.2_chr1.11076901-11077064 

我想選擇DF1這都行in df2

  • 關於 '染色體'
  • DF1相同的值[ '開始位置']> = df2.Startposition
  • DF1 [ '結束位置'] < = df2.Endposition。

如果在df2的同一行中滿足這三個條件,我想選擇df1中的對應行。

我已經融合了'PositionGenome'中的'Chromosome','Startposition'和'Endposition'這三列來生成一個lambda函數,但並沒有提出任何東西。

因此,希望你能幫助我...

+0

請檢查這個[答案](http://stackoverflow.com/a/34953669/2901002) – jezrael

+0

@jezeral。如果我試着回答你的建議,我會得到pd.merge(df1,df2,on = ['Chromosome'])的內存錯誤。 df1有> 10.000行,而df2 2有> 600萬行。我已經將dfs減少到任務所需的少量列,但仍然會出現相同的錯誤。 – SGeuer

+0

確實,在大型數據框中存在問題......不幸的是。 – jezrael

回答

0

短UPDATA:最後,我解決了這個問題與UNIX bedtools -wb。如果有人能夠提出基於python的解決方案,我仍然會很高興。

+0

對不起,我上一篇文章是不完整的。這是解決方案:bedtools相交-a file1.bed -b file2.bed -wb – SGeuer