我有這樣製表符分隔的文件,分組和排序的文件在python
gene_name length
Traes_3AS_4F141FD24.2 24.8
Traes_4AL_A00EF17B2.1 0.0
Traes_4AL_A00EF17B2.1 0.9
Traes_4BS_6943FED4B.1 4.5
Traes_4BS_6943FED4B.1 42.9
UCW_Tt-k25_contig_29046 0.4
UCW_Tt-k25_contig_29046 2.8
UCW_Tt-k25_contig_29046 11.4
UCW_Tt-k25_contig_29046 12.3
UCW_Tt-k25_contig_29046 14.4
UCW_Tt-k25_contig_29046 14.2
UCW_Tt-k25_contig_29046 19.6
UCW_Tt-k25_contig_29046 19.6
UCW_Tt-k25_contig_29046 21.1
UCW_Tt-k25_contig_29046 23.7
UCW_Tt-k25_contig_29046 23.7
我需要組由gene_name,並且在3個文件分文件:1)如果gene_name是獨特2)如果所述差異在組內的基因之間的長度是> 10 3)如果組內的長度中的差異是< 10. 這是我的嘗試,
from itertools import groupby
def iter_hits(hits):
for i in range(1,len(hits)):
(p, c) = hits[i-1], hits[i]
yield p, c
def is_overlap(hits):
for p, c in iter_hits(hits):
if c[1] - p[1] > 10:
return True
fh = open('my_file','r')
oh1 = open('a', 'w')
oh2 = open('b', 'w')
oh3 = open('c', 'w')
for qid, grp in groupby(fh, lambda l: l.split()[0]):
hits = []
for line in grp:
hsp = line.split()
hsp[1]= float(hsp[1])
hits.append(hsp)
hits.sort(key=lambda x: x[1])
if len(hits)==1:
oh = oh3
elif is_overlap(hits):
oh = oh1
else:
oh = oh2
for hit in hits:
oh.write('\t'.join([str(f) for f in hit])+'\n')
我需要的輸出是:
c)Traes_3AS_4F141FD24.2 24.8 b)Traes_4AL_A00EF17B2.1 0.0
Traes_4AL_A00EF17B2.1 0.9
a)Traes_4BS_6943FED4B.1 4.5
Traes_4BS_6943FED4B.1 42.9
UCW_Tt-k25_contig_29046 0.4
UCW_Tt-k25_contig_29046 2.8
UCW_Tt-k25_contig_29046 11.4
UCW_Tt-k25_contig_29046 12.3
UCW_Tt-k25_contig_29046 14.4
UCW_Tt-k25_contig_29046 14.2
UCW_Tt-k25_contig_29046 19.6
UCW_Tt-k25_contig_29046 19.6
UCW_Tt-k25_contig_29046 21.1
UCW_Tt-k25_contig_29046 23.7
UCW_Tt-k25_contig_29046 23.7
P.S.我很抱歉有這麼長的一個問題,但否則我很難解釋清楚。
你想說什麼馬上?你有什麼錯誤嗎? –
基因UCW_Tt-k25_contig_29046導致文件b,我想這是bcos我正在做一個從previou基因長度的減法,如何改進? – user3224522
如果有兩個值大於10的值,你需要它們在'c'文件中結束嗎? –