2017-02-15 147 views
0

你可以看到文件中的那樣:嵌套的字典和值

LOC_Os06g07630.1 cyto 8, chlo 2, extr 2, nucl 1, cysk 1, chlo_mito 1, cysk_nucl 1 
LOC_Os06g12160.1 chlo 7, nucl 3, mito 2.5, cyto_mito 2 
LOC_Os06g39870.1 chlo 7, cyto 4, nucl 1, E.R. 1, pero 1 
LOC_Os06g48240.1 chlo 9, mito 4 
LOC_Os06g48250.1 cyto 5, chlo 4, mito 2, pero 2 

我在乎「氯仿溶劑」和「chlo_mito」和「美圖」,而和值每一行中

像行LOC_Os06g07630.1,我將使用氯仿溶劑2和chlo_mito 1, 總和值是3 =(氯仿溶劑)2+(chlo_mito)1個

所述行總和值是

(細胞學)8+(氯仿溶劑)2+(抽)2+(NUCL)1+(CY SK)1+(chlo_mito)1+(cysk_nucl)1 = 16,然後打印3/16

我想下一個內容:

LOC_Os06g07630.1 chlo 2 chlo_mito 1 3/16 
LOC_Os06g12160.1 chlo 7 mito 2.5 9.5/14.5 
LOC_Os06g39870.1 chlo 7 7/15 
LOC_Os06g48240.1 chlo 9 mito 4 13/13 
LOC_Os06g48250.1 chlo 4 mito 2 6/13 

我的代碼是:

import re 
dic={} 
b=re.compile("chlo|mito|chlo_mito") 
with open("~/A","r") as f1: 
    for i in f1: 
     if i.startswith("#"):continue 
     a=i.replace(',',"").replace(" ","/") 
     m=b.search(a) 
     if m is not None: 
      dic[a.strip().split("/")[0]]={} 
      temp=a.strip().split("/")[1:] 
      c=range(1,len(temp),2) 
      for x in c: 
       dic[a.strip().split("/")[0]][temp[x-1]]=temp[x] 
       #print dic 
lis=["chlo","mito","chlo_mito"] 
for k in dic: 
    sum_value=0 
    sum_values=0  
    for x in dic[k]:       
     sum_value=sum_value+float(dic[k][x]) 
     for i in lis: 
     #sum_values=0 
     if i in dic[k]: 
      #print i,dic[k][i] 
      sum_values=sum_value+float(dic[k][i]) 
      print k,dic[k],i,sum_values 
     #print k,dic[k] 

回答

0

你在描述你有什麼問題時不太清楚。但是我會做什麼:編寫一個函數,它將文件中的一行作爲輸入,並返回帶有「chlo」,「chlo_mito」,「mito」和「total sum」鍵的字典。這應該讓你的生活更輕鬆。

+0

但是每一行都有其他像「nucl」等等,它們的數目是不同的 – zychen

0

這樣的代碼的東西可以幫助你:

我假設你的輸入文件被稱爲f_input.txt

from ast import literal_eval as eval 

data = (k.rstrip().replace(',', '').split() for k in open("f_input.txt", 'r')) 

for k in data: 
    chlo = sum(eval(k[j+1]) for j in range(len(k)-1) if k[j] == 'chlo') 
    mito = sum(eval(k[j+1]) for j in range(len(k)-1) if k[j] == 'mito') 
    chlo_mito = sum(eval(k[j+1]) for j in range(len(k)-1) if k[j] == 'chlo_mito') 
    total = sum(eval(k[j]) for j in range(2, len(k), 2)) 
    if mito == 0 and chlo_mito != 0: 
     print("{0} chlo {1} chlo_mito {2} {3}/{4}".format(k[0], chlo, chlo_mito, chlo + chlo_mito, total)) 
    elif mito != 0 and chlo_mito == 0: 
     print("{0} chlo {1} mito {2} {3}/{4}".format(k[0], chlo, mito, chlo + mito, total)) 
    elif mito !=0 and chlo_mito != 0: 
     print("{0} chlo {1} mito {2} chlo_mito {3} {4}/{5}".format(k[0], chlo, mito, chlo_mito, chlo + mito + chlo_mito, total)) 
    elif mito ==0 and chlo_mito == 0: 
     print("{0} chlo {1} {2}/{3}".format(k[0], chlo, chlo , total)) 

輸出:

LOC_Os06g07630.1 chlo 2 chlo_mito 1 3/16 
LOC_Os06g12160.1 chlo 7 mito 2.5 9.5/14.5 
LOC_Os06g39870.1 chlo 7 7/14 
LOC_Os06g48240.1 chlo 9 mito 4 13/13 
LOC_Os06g48250.1 chlo 4 mito 2 6/13 
0

我不知道有多少速度對你的關注,但通常是基因組學。如果可以避免的話,你應該不要使用太多的字符串操作,並儘可能少地使用正則表達式。

這是一個不使用regexen的版本,並且儘量不花時間構造臨時對象。我選擇使用不同於輸出格式的輸出格式,因爲您的輸出格式很難再次解析。您可以通過修改.format字符串輕鬆地將其更改。

Test_data = """ 
LOC_Os06g07630.1 cyto 8, chlo 2, extr 2, nucl 1, cysk 1, chlo_mito 1, cysk_nucl 1 
LOC_Os06g12160.1 chlo 7, nucl 3, mito 2.5, cyto_mito 2 
LOC_Os06g39870.1 chlo 7, cyto 4, nucl 1, E.R. 1, pero 1 
LOC_Os06g48240.1 chlo 9, mito 4 
LOC_Os06g48250.1 cyto 5, chlo 4, mito 2, pero 2 
""" 

def open_input(): 
    """ 
    Return a file-like object as input stream. In this case, 
    it is a StringIO based on your test data. If you have a file 
    name, use that instead. 
    """ 

    if False: 
     return open('inputfile.txt', 'r') 
    else: 
     import io 
     return io.StringIO(Test_data) 

SUM_FIELDS = set("chlo mito chlo_mito".split()) 

with open_input() as infile: 

    for line in infile: 

     line = line.strip() 
     if not line: continue 

     cols = line.split(maxsplit=1) 
     if len(cols) != 2: continue 

     test_id,remainder = cols 
     out_fields = [] 

     fld_sum = tot_sum = 0.0 

     for pair in remainder.split(', '): 
      k,v = pair.rsplit(maxsplit=1) 
      vf = float(v) 
      tot_sum += vf 

      if k in SUM_FIELDS: 
       fld_sum += vf 
       out_fields.append(pair) 

     print("{0} {2}/{3} ({4:.0%}) {1}".format(test_id, ', '.join(out_fields), fld_sum, tot_sum, fld_sum/tot_sum))