分組和排序的文件在python

-1

gene_name    length 
Traes_3AS_4F141FD24.2 24.8  
Traes_4AL_A00EF17B2.1 0.0 
Traes_4AL_A00EF17B2.1 0.9 
Traes_4BS_6943FED4B.1 4.5 
Traes_4BS_6943FED4B.1 42.9  
UCW_Tt-k25_contig_29046 0.4 
UCW_Tt-k25_contig_29046 2.8 
UCW_Tt-k25_contig_29046 11.4  
UCW_Tt-k25_contig_29046 12.3  
UCW_Tt-k25_contig_29046 14.4 
UCW_Tt-k25_contig_29046 14.2  
UCW_Tt-k25_contig_29046 19.6  
UCW_Tt-k25_contig_29046 19.6 
UCW_Tt-k25_contig_29046 21.1  
UCW_Tt-k25_contig_29046 23.7  
UCW_Tt-k25_contig_29046 23.7

我需要組由gene_name，並且在3個文件分文件：1）如果gene_name是獨特2）如果所述差異在組內的基因之間的長度是> 10 3）如果組內的長度中的差異是< 10. 這是我的嘗試，

from itertools import groupby 

def iter_hits(hits): 
    for i in range(1,len(hits)): 
     (p, c) = hits[i-1], hits[i] 
     yield p, c 

def is_overlap(hits): 
    for p, c in iter_hits(hits): 
     if c[1] - p[1] > 10: 
      return True 

fh = open('my_file','r') 
oh1 = open('a', 'w') 
oh2 = open('b', 'w') 
oh3 = open('c', 'w') 

for qid, grp in groupby(fh, lambda l: l.split()[0]): 
    hits = [] 
    for line in grp: 
     hsp = line.split() 
     hsp[1]= float(hsp[1]) 
     hits.append(hsp) 
    hits.sort(key=lambda x: x[1]) 
    if len(hits)==1: 
     oh = oh3 
    elif is_overlap(hits): 
     oh = oh1 
    else: 
     oh = oh2 

    for hit in hits: 
     oh.write('\t'.join([str(f) for f in hit])+'\n')

我需要的輸出是：

c)Traes_3AS_4F141FD24.2 24.8   b)Traes_4AL_A00EF17B2.1 0.0 
              Traes_4AL_A00EF17B2.1 0.9 
a)Traes_4BS_6943FED4B.1 4.5 
Traes_4BS_6943FED4B.1 42.9  
UCW_Tt-k25_contig_29046 0.4 
UCW_Tt-k25_contig_29046 2.8 
UCW_Tt-k25_contig_29046 11.4  
UCW_Tt-k25_contig_29046 12.3  
UCW_Tt-k25_contig_29046 14.4 
UCW_Tt-k25_contig_29046 14.2  
UCW_Tt-k25_contig_29046 19.6  
UCW_Tt-k25_contig_29046 19.6 
UCW_Tt-k25_contig_29046 21.1  
UCW_Tt-k25_contig_29046 23.7  
UCW_Tt-k25_contig_29046 23.7

P.S.我很抱歉有這麼長的一個問題，但否則我很難解釋清楚。

來源

2015-07-20 user3224522

你想說什麼馬上？你有什麼錯誤嗎？ –

基因UCW_Tt-k25_contig_29046導致文件b，我想這是bcos我正在做一個從previou基因長度的減法，如何改進？ – user3224522

如果有兩個值大於10的值，你需要它們在'c'文件中結束嗎？ –

如果你的目標是 -

我需要的所有基因的長度，其中有一個文件超過10個不同是，即23.7-0.4> 10所以應該在一個文件中。

然後在is_overlap(hits)你可以檢查的最後一個元素和第一個元素之間的不同，因爲你是第二個元素調用此函數之前已經對它們進行排序，最後一個元素將是最大的，而第一元素將是最小的。

因此，你可以做 -

def is_overlap(hits): 
    if hits[-1][1] - hits[0][1] > 10: 
     return True

來源

2015-07-20 11:56:56

你的數據似乎已經在有序其中，所以你剛剛從各組比較第一個和最後彩車：

from itertools import groupby 

with open('a', 'w') as uniq, open('b', 'w') as lt, open('c', 'w') as gt: 
    with open("foo.txt") as f: 
     next(f) 
     for _, v in groupby(f, lambda x: x.split(None, 1)[0]): 
      v = list(v) 
      if len(v) == 1: 
       uniq.write(v[0]) 
      elif float(v[-1].split(None, 1)[1]) - float(v[0].split(None, 1)[1]) < 10: 
       lt.writelines(v) 
      elif float(v[-1].split(None, 1)[1]) - float(v[0].split(None, 1)[1]) > 10: 
       gt.writelines(v)

來源

2015-07-20 13:03:20

分組和排序的文件在python

回答

相關問題