根據第二列中的行選擇行

我有兩個dfs，並尋找一種方法來根據df2中的行選擇（和計數）df1行。根據第二列中的行選擇行

這是我的DF1：

 Chromosome Start position End position Reference Variant reads \ 
0  chr1  109419841  109419841   C  T  1 
1  chr1  197008365  197008365   C  T  1 

    variation reads % variation     gDNA nomencl \ 
0    1   100 Chr1(GRCh37):g.109419841C>T 
1    1   100 Chr1(GRCh37):g.197008365C>T 

      cDNA nomencl ... exon transcript ID   inheritance \ 
0 NM_013296.4:c.-258C>T ...  2 NM_013296.4 Autosomal recessive 
1 NM_001994.2:c.*143G>A ...  UTR NM_001994.2 Autosomal recessive 

    test type      Phenotype male coverage male ratio covered \ 
0 Unknown Deafness, autosomal recessief    0     0 
1 Unknown   Factor 13 deficientie    0     0 

    female coverage female ratio covered ratio M:F 
0    1     1  0.0 
1    1     1  0.0

DF1有這些列：

Chromosome    10561 non-null object 
Start position   10561 non-null int64 
End position    10561 non-null int64 
Reference     10415 non-null object 
Variant     10536 non-null object 
reads      10561 non-null int64 
variation reads   10561 non-null int64 
% variation    10561 non-null int64 
gDNA nomencl    10561 non-null object 
cDNA nomencl    10446 non-null object 
protein nomencl   9997 non-null object 
classification   10561 non-null object 
status     10561 non-null object 
gene      10560 non-null object 
Sanger sequencing list 10561 non-null object 
exon      10502 non-null object 
transcript ID    10460 non-null object 
inheritance    8259 non-null object 
test type     10561 non-null object 
Phenotype     10380 non-null object 
male coverage    10561 non-null int64 
male ratio covered  10561 non-null int64 
female coverage   10561 non-null int64 
female ratio covered  10561 non-null int64

，這是DF2：

Chromosome Startposition Endposition Bases Meancoverage \ 
0  chr1  11073785  11074022 27831.0 117.927966 
1  chr1  11076901  11077064 11803.0  72.411043 

    Mediancoverage Ratiocovered>10X Ratiocovered>20X Genename Componentnr \ 
0   97.0    1.0    1.0 TARDBP   1 
1   76.0    1.0    1.0 TARDBP   2 

    PositionGenes   PositionGenome      Position 
0  TARDBP.1 chr1.11073785-11074022 comp.1_chr1.11073785-11074022 
1  TARDBP.2 chr1.11076901-11077064 comp.2_chr1.11076901-11077064

我想選擇DF1這都行in df2

關於 '染色體'
DF1相同的值[ '開始位置']> = df2.Startposition
DF1 [ '結束位置'] < = df2.Endposition。

如果在df2的同一行中滿足這三個條件，我想選擇df1中的對應行。

我已經融合了'PositionGenome'中的'Chromosome'，'Startposition'和'Endposition'這三列來生成一個lambda函數，但並沒有提出任何東西。

因此，希望你能幫助我...

來源

2017-02-16 SGeuer

請檢查這個[答案]（http://stackoverflow.com/a/34953669/2901002） – jezrael

@jezeral。如果我試着回答你的建議，我會得到pd.merge（df1，df2，on = ['Chromosome']）的內存錯誤。 df1有> 10.000行，而df2 2有> 600萬行。我已經將dfs減少到任務所需的少量列，但仍然會出現相同的錯誤。 – SGeuer

確實，在大型數據框中存在問題......不幸的是。 – jezrael

短UPDATA：最後，我解決了這個問題與UNIX bedtools -wb。如果有人能夠提出基於python的解決方案，我仍然會很高興。

來源

2017-02-22 13:32:46 SGeuer

對不起，我上一篇文章是不完整的。這是解決方案：bedtools相交-a file1.bed -b file2.bed -wb – SGeuer

根據第二列中的行選擇行

回答

相關問題