2017-10-11 39 views
-4

我想如下修改如下程序:修改代碼,以便它可以從文件中讀取,併產生相應的輸出

  1. 第一行包含蛋白質的名稱和計數隨後的這種蛋白質的輸出線(如N)

  2. 接下來的N行中的每一行都包含一個匹配信息:GBoxes的位置和實際匹配(記住存在擾動和X的即通配符,允許)。

腳本:

import csv 
import matplotlib.pyplot as plt 
import numpy as np 

# all G boxes 
def match(x,y): 
     mismatch = 0 
     for i in range(len(x)): 
       if (x[i] == 'X' or x[i] == y[i]): 
         pass 
       else: 
         mismatch += 1 
     if(mismatch <= 1): 
       return True 
     else: 
       return False 



def H(protein,x1,x2,x3,x4): 
     pL1=[] 
     pL2=[] 
     pL3=[] 
     pL4=[] 
     L1=[] 
     L2=[] 
     L3=[] 
     L4=[] 


     for i in range(len(protein)-len(x1)): 
       if(match(x1, protein[i:i+len(x1)]) == True): 
#      global L1 
         pL1=pL1 + [i] 
         L1 = L1+[protein[i:i+len(x1)]] 


     for j in range(len(protein)-len(x2)): 
       if (match(x2, protein[j:j+len(x2)]) == True): 
#      global L2 
         pL2=pL2+[j] 
         L2 = L2+[protein[j:j+len(x2)]] 

     for k in range(len(protein)-len(x3)): 
       if (match(x3, protein[k:k+len(x3)]) == True): 
#       global L3 
         pL3=pL3+[k] 
         L3 = L3+[protein[k:k+len(x3)]] 

     for l in range(len(protein)-len(x4)): 
       if (match(x4, protein[l:l+len(x4)]) == True): 
#      global L3 
         pL4=pL4+[l] 
         L4 = L4+[protein[l:l+len(x4)]] 
     candidates = [] 

     for i in range(len(pL1)): 
       for j in range(len(pL2)): 
         for k in range(len(pL3)): 
           for l in range(len(pL4)): 
             if 40 <=pL2[j]-pL1[i] <= 80 and 40 <=pL3[k]- pL2[j] <= 80 and 20 <=pL4[l]- pL3[k] <= 40: 
               a = L1[i],pL1[i] 
               b = L2[j],pL2[j] 
               c = L3[k],pL3[k] 
               d = L4[l],pL4[l] 
               print a,b,c,d 
               candidates.append((a,b,c,d)) 


     offset = 5 
     for i in np.arange((np.array(candidates).transpose()).shape[1]): 
       barchartdata = np.unique(np.array(candidates).transpose()[:,i]) 
       barchartdata = barchartdata.reshape(2, len(barchartdata)/2) 
       print barchartdata 
       x_pos = np.arange(barchartdata.size/2) 
       print x_pos 
       print barchartdata[0,:] 
       plt.bar(x_pos + 5 * i, barchartdata[0,:]) 

     plt.show() 
     plt.xticks(x_pos, ('g1','g2','g3','g4')) 
     plt.yticks('Count') 
     plt.show() 

x1 = 'GXXXXGK' 
x2 = 'DXXG' 
x3 = 'NKXD' 
x4 = 'EXSAX' 
#input sequence 
protein = 'MAKGEFIRTKPHVNVGTIGHVDHGKTTLTAALTYVAAAENPNVEVKDYGEIDKAPEERARGITINTAHVEYETAKRHYSHVDCPGHADYIKNMITGAAQMDGAILVVSAADGPMPQTREHILLARQVGVPYIVVFMNKVDMVDDPELLDLVEMEVRDLLNQYEFPGDEVPVIRGSALLALEQMHRNPKTRRGENEWVDKIWELLDAIDEYIPTPVRDVDKPFLMPVEDVFTITGRGTVATGRIERGKVKVGDEVEIVGLAPETRKTVVTGVEMHRKTLQEGIAGDNVGVLLRGVSREEVERGQVLAKPGSITPHTKFEASVYVLKKEEGGRHTGFFSGYRPQFYFRTTDVTGVVQLPPGVEMVMPGDNVTFTVELIKPVALEEGLRFAIREGGRTVGAGVVTKILE' 
H(protein,x1,x2,x3,x4) 

編輯

前面的輸出(我的劇本) - 正確:

('GAGGVGK', 9) ('DILD', 53) ('NKCD', 115) ('ETSAK', 142) 
('GAGGVGK', 9) ('DTAG', 56) ('NKCD', 115) ('ETSAK', 142) 
('GAGGVGK', 9) ('DQYM', 68) ('NKCD', 115) ('ETSAK', 142) 
('GAGGVGK', 9) ('MRTG', 71) ('NKCD', 115) ('ETSAK', 142) 
('GAGGVGK', 9) ('TGEG', 73) ('NKCD', 115) ('ETSAK', 142) 

獲得輸出在你的腳本:

((17, 'GAGGVGK'), (61, 'DILD'), (123, 'NKCD'), (150, 'ETSAK')) 
((17, 'GAGGVGK'), (64, 'DTAG'), (123, 'NKCD'), (150, 'ETSAK')) 
((17, 'GAGGVGK'), (76, 'DQYM'), (123, 'NKCD'), (150, 'ETSAK')) 
((17, 'GAGGVGK'), (79, 'MRTG'), (123, 'NKCD'), (150, 'ETSAK')) 
((17, 'GAGGVGK'), (81, 'TGEG'), (123, 'NKCD'), (150, 'ETSAK')) 

這是長度不正確。 我需要運行多個序列,但它只能運行一個序列。請指導我

我也試圖繪製一個圖,但它不能得到預期的輸出。

預計圖表:

在這個形象是預期的圖形 - 我們需要明智計算百分比列 - 請查看「以前的輸出(我的劇本) - 正確:請檢查下面形象的比方。

Expected graph image

編輯1

輸入文件是一個CSV文件格式如下(多行):

PDB ID Macromolecule Name Sequence Source 
121P H-RAS P21 PROTEIN MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAGQEEYSAMRDQYMRTGEGFLCVFAINNTKSFEDIHQYREQIKRVKDSDDVPMVLVGNKCDLAARTVESRQAQDLARSYGIPYIETSAKTRQGVEDAFYTLVREIRQH Homo sapiens 
1A12 REGULATOR OF CHROMOSOME CONDENSATION 1 RRSPPADAIPKSKKVKVSHRSHSTEPGLVLTLGQGDVGQLGLGENVMERKKPALVSIPEDVVQAEAGGMHTVCLSKSGQVYSFGCNDEGALGRDTSVEGSEMVPGKVELQEKVVQVSAGDSHTAALTDDGRVFLWGSFRDNNGVIGLLEPMKKSMVPVQVQLDVPVVKVASGNDHLVMLTADGDLYTLGCGEQGQLGRVPELFANRGGRQGLERLLVPKCVMLKSRGSRGHVRFQDAFCGAYFTFAISHEGHVYGFGLSNYHQLGTPGTESCFIPQNLTSFKNSTKSWVGFSGGQHHTVCMDSEGKAYSLGRAEYGRLGLGEGAEEKSIPTLISRLPAVSSVACGASVGYAVTKDGRVFAWGMGTNYQLGTGQDEDAWSPVEMMGKQLENRVVLSVSSGGQHTVLLVKDKEQS Homo sapiens 
1A2B TRANSFORMING PROTEIN RHOA SMAAIRKKLVIVGDVACGKTCLLIVFSKDQFPEVYVPTVFENYVADIEVDGKQVELALWDTAGQEDYDRLRPLSYPDTDVILMCFSIDSPDSLENIPEKWTPEVKHFCPNVPIILVGNKKDLRNDEHTRRELAKMKQEPVKPEEGRDMANRIGAFGYMECSAKTKDGVREVFEMATRAALQA Homo sapiens 
1A2K NUCLEAR TRANSPORT FACTOR 2 MGDKPIWEQIGSSFIQHYYQLFDNDRTQLGAIYIDASCLTWEGQQFQGKAAIVEKLSSLPFQKIQHSITAQDHQPTPDSCIISMVVGQLKADEDPIMGFHQMFLLKNINDAWVCTNDMFRLALHNFG Rattus norvegicus 

回答

1

你的代碼是非常難以處理。我已經重寫它來清理它一點。我已經刪除了所有關於我選擇改變的背景信息以及原因,因爲它們太多了。但是,這裏有一些我固定的東西:

  • 你在寫作風格unpythonic循環的趨勢

    for i in range(len(x)): 
        do_something_with(x[i]) 
    

    在Python是更好的寫

    for element in x: 
        do_something_with(element) 
    
  • 您在if附近有許多多餘的括號 - 條件

  • 您已經重寫了基本相同的大for四次循環,而不是編寫包含循環的函數並重用它。

不幸的是它打破由40 S IN的四重嵌套for -loop建立的模式的20,取得了itertools.product多一點麻煩,不值得用一個循環替換它。

無論如何,這裏是清理後的代碼版本。

import csv 
import matplotlib.pyplot as plt 
import numpy as np 

def match(X,Y): 
     mismatch = 0 
     for x,y in zip(X,Y): 
       if not (x == 'X' or x == y): 
         mismatch += 1 
         if mismatch > 1: 
          return False 
     return True 

def H(protein,x1,x2,x3,x4): 

     def find_matches(x): 
      match_positions = [] 
      matches   = [] 
      for i in range(len(protein) - len(x)): 
       candidate = protein[i : i + len(x)] 
       if match(x, candidate): 
        match_positions.append(i) 
        matches  .append(candidate) 
      return matches, match_positions 

     L1, pL1 = find_matches(x1) 
     L2, pL2 = find_matches(x2) 
     L3, pL3 = find_matches(x3) 
     L4, pL4 = find_matches(x4) 

     candidates = [] 
     for a in zip(pL1, L1): 
      for b in zip(pL2, L2): 
       for c in zip(pL3, L3): 
        for d in zip(pL4, L4): 
         if (40 <= b[0] - a[0] <= 80 and 
          40 <= c[0] - b[0] <= 80 and 
          20 <= d[0] - c[0] <= 80 ): 
          print(a,b,c,d) 
          candidates.append((a,b,c,d)) 

     for i in np.arange((np.array(candidates).transpose()).shape[1]): 
       barchartdata = np.unique(np.array(candidates).transpose()[:,i]) 
       barchartdata = barchartdata.reshape(2, len(barchartdata)//2) 
       print (barchartdata) 
       x_pos = np.arange(barchartdata.size/2) 
       print (x_pos) 
       print (barchartdata[0,:]) 
       plt.bar(x_pos + 5 * i, barchartdata[0,:]) 

     plt.show() 
     plt.xticks(x_pos, ('g1','g2','g3','g4')) 
     plt.yticks('Count') 
     plt.show() 

x1 = 'GXXXXGK' 
x2 = 'DXXG' 
x3 = 'NKXD' 
x4 = 'EXSAX' 

protein = 'MAKGEFIRTKPHVNVGTIGHVDHGKTTLTAALTYVAAAENPNVEVKDYGEIDKAPEERARGITINTAHVEYETAKRHYSHVDCPGHADYIKNMITGAAQMDGAILVVSAADGPMPQTREHILLARQVGVPYIVVFMNKVDMVDDPELLDLVEMEVRDLLNQYEFPGDEVPVIRGSALLALEQMHRNPKTRRGENEWVDKIWELLDAIDEYIPTPVRDVDKPFLMPVEDVFTITGRGTVATGRIERGKVKVGDEVEIVGLAPETRKTVVTGVEMHRKTLQEGIAGDNVGVLLRGVSREEVERGQVLAKPGSITPHTKFEASVYVLKKEEGGRHTGFFSGYRPQFYFRTTDVTGVVQLPPGVEMVMPGDNVTFTVELIKPVALEEGLRFAIREGGRTVGAGVVTKILE' 
H(protein,x1,x2,x3,x4) 

當我執行這個上面的代碼中,我得到的輸出

((3, 'GEFIRTK'), (57, 'RARG'), (136, 'NKVD'), (172, 'RGSAL')) 
((3, 'GEFIRTK'), (81, 'DCPG'), (136, 'NKVD'), (172, 'RGSAL')) 
((18, 'GHVDHGK'), (81, 'DCPG'), (136, 'NKVD'), (172, 'RGSAL')) 
((18, 'GHVDHGK'), (87, 'DYIK'), (136, 'NKVD'), (172, 'RGSAL')) 
((18, 'GHVDHGK'), (92, 'MITG'), (136, 'NKVD'), (172, 'RGSAL')) 

當我執行這可見在你的問題在寫作時的代碼,我得到的輸出

('GEFIRTK', 3) ('RARG', 57) ('NKVD', 136) ('RGSAL', 172) 
('GEFIRTK', 3) ('DCPG', 81) ('NKVD', 136) ('RGSAL', 172) 
('GHVDHGK', 18) ('DCPG', 81) ('NKVD', 136) ('RGSAL', 172) 
('GHVDHGK', 18) ('DYIK', 87) ('NKVD', 136) ('RGSAL', 172) 
('GHVDHGK', 18) ('MITG', 92) ('NKVD', 136) ('RGSAL', 172) 

你是否在茫然中確定你跑了這兩個代碼sa我的數據?請注意,您的代碼正在使用硬編碼到代碼中的protein,而我的代碼,它使用樣本CSV文件內容中的數據。前者以'MAKG'開頭,後者以'MTEY'開頭。這些數據不是一樣的,所以你會期望他們給出不同的結果!

注意,我在上面所示的代碼的已編輯版本回復到使用所述硬連線protein它的代碼本身,及給出相同的結果到您

下面是一個代碼示例,示出了如何從CSV文件中讀取您的蛋白質數據:

with open('xxx.csv') as infile: 
    lines = csv.reader(infile, delimiter=' ', skipinitialspace=True) 
    next(lines) # skip header 
    for line in lines: 
     protein = line[4] 
     # Do whatever you want with protein here ... 
+0

謝謝,你能否建議我如何添加CSV文件作爲輸入序列?在輸入文件中沒有蛋白質序列。我想運行所有序列 – vishnu

+0

@vishnu我編輯了答案,以包括我猜你的問題可能意味着什麼的答案。祝你好運。 – jacg

+0

非常感謝,但沒有得到正確的結果。我已編輯我的問題,請檢查一次 – vishnu

相關問題