我想如下修改如下程序:修改代碼,以便它可以從文件中讀取,併產生相應的輸出
第一行包含蛋白質的名稱和計數隨後的這種蛋白質的輸出線(如N)
接下來的N行中的每一行都包含一個匹配信息:GBoxes的位置和實際匹配(記住存在擾動和X的即通配符,允許)。
腳本:
import csv
import matplotlib.pyplot as plt
import numpy as np
# all G boxes
def match(x,y):
mismatch = 0
for i in range(len(x)):
if (x[i] == 'X' or x[i] == y[i]):
pass
else:
mismatch += 1
if(mismatch <= 1):
return True
else:
return False
def H(protein,x1,x2,x3,x4):
pL1=[]
pL2=[]
pL3=[]
pL4=[]
L1=[]
L2=[]
L3=[]
L4=[]
for i in range(len(protein)-len(x1)):
if(match(x1, protein[i:i+len(x1)]) == True):
# global L1
pL1=pL1 + [i]
L1 = L1+[protein[i:i+len(x1)]]
for j in range(len(protein)-len(x2)):
if (match(x2, protein[j:j+len(x2)]) == True):
# global L2
pL2=pL2+[j]
L2 = L2+[protein[j:j+len(x2)]]
for k in range(len(protein)-len(x3)):
if (match(x3, protein[k:k+len(x3)]) == True):
# global L3
pL3=pL3+[k]
L3 = L3+[protein[k:k+len(x3)]]
for l in range(len(protein)-len(x4)):
if (match(x4, protein[l:l+len(x4)]) == True):
# global L3
pL4=pL4+[l]
L4 = L4+[protein[l:l+len(x4)]]
candidates = []
for i in range(len(pL1)):
for j in range(len(pL2)):
for k in range(len(pL3)):
for l in range(len(pL4)):
if 40 <=pL2[j]-pL1[i] <= 80 and 40 <=pL3[k]- pL2[j] <= 80 and 20 <=pL4[l]- pL3[k] <= 40:
a = L1[i],pL1[i]
b = L2[j],pL2[j]
c = L3[k],pL3[k]
d = L4[l],pL4[l]
print a,b,c,d
candidates.append((a,b,c,d))
offset = 5
for i in np.arange((np.array(candidates).transpose()).shape[1]):
barchartdata = np.unique(np.array(candidates).transpose()[:,i])
barchartdata = barchartdata.reshape(2, len(barchartdata)/2)
print barchartdata
x_pos = np.arange(barchartdata.size/2)
print x_pos
print barchartdata[0,:]
plt.bar(x_pos + 5 * i, barchartdata[0,:])
plt.show()
plt.xticks(x_pos, ('g1','g2','g3','g4'))
plt.yticks('Count')
plt.show()
x1 = 'GXXXXGK'
x2 = 'DXXG'
x3 = 'NKXD'
x4 = 'EXSAX'
#input sequence
protein = 'MAKGEFIRTKPHVNVGTIGHVDHGKTTLTAALTYVAAAENPNVEVKDYGEIDKAPEERARGITINTAHVEYETAKRHYSHVDCPGHADYIKNMITGAAQMDGAILVVSAADGPMPQTREHILLARQVGVPYIVVFMNKVDMVDDPELLDLVEMEVRDLLNQYEFPGDEVPVIRGSALLALEQMHRNPKTRRGENEWVDKIWELLDAIDEYIPTPVRDVDKPFLMPVEDVFTITGRGTVATGRIERGKVKVGDEVEIVGLAPETRKTVVTGVEMHRKTLQEGIAGDNVGVLLRGVSREEVERGQVLAKPGSITPHTKFEASVYVLKKEEGGRHTGFFSGYRPQFYFRTTDVTGVVQLPPGVEMVMPGDNVTFTVELIKPVALEEGLRFAIREGGRTVGAGVVTKILE'
H(protein,x1,x2,x3,x4)
編輯
前面的輸出(我的劇本) - 正確:
('GAGGVGK', 9) ('DILD', 53) ('NKCD', 115) ('ETSAK', 142)
('GAGGVGK', 9) ('DTAG', 56) ('NKCD', 115) ('ETSAK', 142)
('GAGGVGK', 9) ('DQYM', 68) ('NKCD', 115) ('ETSAK', 142)
('GAGGVGK', 9) ('MRTG', 71) ('NKCD', 115) ('ETSAK', 142)
('GAGGVGK', 9) ('TGEG', 73) ('NKCD', 115) ('ETSAK', 142)
獲得輸出在你的腳本:
((17, 'GAGGVGK'), (61, 'DILD'), (123, 'NKCD'), (150, 'ETSAK'))
((17, 'GAGGVGK'), (64, 'DTAG'), (123, 'NKCD'), (150, 'ETSAK'))
((17, 'GAGGVGK'), (76, 'DQYM'), (123, 'NKCD'), (150, 'ETSAK'))
((17, 'GAGGVGK'), (79, 'MRTG'), (123, 'NKCD'), (150, 'ETSAK'))
((17, 'GAGGVGK'), (81, 'TGEG'), (123, 'NKCD'), (150, 'ETSAK'))
這是長度不正確。 我需要運行多個序列,但它只能運行一個序列。請指導我
我也試圖繪製一個圖,但它不能得到預期的輸出。
預計圖表:
在這個形象是預期的圖形 - 我們需要明智計算百分比列 - 請查看「以前的輸出(我的劇本) - 正確:請檢查下面形象的比方。
編輯1
輸入文件是一個CSV文件格式如下(多行):
PDB ID Macromolecule Name Sequence Source
121P H-RAS P21 PROTEIN MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAGQEEYSAMRDQYMRTGEGFLCVFAINNTKSFEDIHQYREQIKRVKDSDDVPMVLVGNKCDLAARTVESRQAQDLARSYGIPYIETSAKTRQGVEDAFYTLVREIRQH Homo sapiens
1A12 REGULATOR OF CHROMOSOME CONDENSATION 1 RRSPPADAIPKSKKVKVSHRSHSTEPGLVLTLGQGDVGQLGLGENVMERKKPALVSIPEDVVQAEAGGMHTVCLSKSGQVYSFGCNDEGALGRDTSVEGSEMVPGKVELQEKVVQVSAGDSHTAALTDDGRVFLWGSFRDNNGVIGLLEPMKKSMVPVQVQLDVPVVKVASGNDHLVMLTADGDLYTLGCGEQGQLGRVPELFANRGGRQGLERLLVPKCVMLKSRGSRGHVRFQDAFCGAYFTFAISHEGHVYGFGLSNYHQLGTPGTESCFIPQNLTSFKNSTKSWVGFSGGQHHTVCMDSEGKAYSLGRAEYGRLGLGEGAEEKSIPTLISRLPAVSSVACGASVGYAVTKDGRVFAWGMGTNYQLGTGQDEDAWSPVEMMGKQLENRVVLSVSSGGQHTVLLVKDKEQS Homo sapiens
1A2B TRANSFORMING PROTEIN RHOA SMAAIRKKLVIVGDVACGKTCLLIVFSKDQFPEVYVPTVFENYVADIEVDGKQVELALWDTAGQEDYDRLRPLSYPDTDVILMCFSIDSPDSLENIPEKWTPEVKHFCPNVPIILVGNKKDLRNDEHTRRELAKMKQEPVKPEEGRDMANRIGAFGYMECSAKTKDGVREVFEMATRAALQA Homo sapiens
1A2K NUCLEAR TRANSPORT FACTOR 2 MGDKPIWEQIGSSFIQHYYQLFDNDRTQLGAIYIDASCLTWEGQQFQGKAAIVEKLSSLPFQKIQHSITAQDHQPTPDSCIISMVVGQLKADEDPIMGFHQMFLLKNINDAWVCTNDMFRLALHNFG Rattus norvegicus
謝謝,你能否建議我如何添加CSV文件作爲輸入序列?在輸入文件中沒有蛋白質序列。我想運行所有序列 – vishnu
@vishnu我編輯了答案,以包括我猜你的問題可能意味着什麼的答案。祝你好運。 – jacg
非常感謝,但沒有得到正確的結果。我已編輯我的問題,請檢查一次 – vishnu