好了,所以我用line_profiler來分析你的代碼:
from random import randrange
@profile
def scoreMotifs(motifs):
'''This function computes the score of list of motifs'''
z = []
for i in range(len(motifs[0])):
y = ''
for j in range(len(motifs)):
y += motifs[j][i]
z.append(y)
totalscore = 0
for string in z:
score = len(string)-max([string.count('A'),string.count('C'), string.count('G'), string.count('T')])
totalscore += score
return totalscore
def random_seq():
dna_mapping = ['T', 'A', 'C', 'G']
return ''.join([dna_mapping[randrange(4)] for _ in range(3)])
motifs = [random_seq() for _ in range(1000000)]
print scoreMotifs(motifs)
這些是結果:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
3
4
5
6 1 4 4.0 0.0
7 4 14 3.5 0.0
8 3 2 0.7 0.0
9 3000003 1502627 0.5 41.7
10 3000000 2075204 0.7 57.5
11 3 22 7.3 0.0
12 1 1 1.0 0.0
13 4 4 1.0 0.0
14 3 29489 9829.7 0.8
15 3 5 1.7 0.0
16 1 1 1.0 0.0
Total Time: 3.60737 s
有一個巨大的計算量與:
y += motifs[j][i]
雖然使用zip
技巧,但還是有更好的轉置字符串的方法。因此,您可以重寫你的代碼:
from random import randrange
@profile
def scoreMotifs(motifs):
'''This function computes the score of list of motifs'''
z = zip(*motifs)
totalscore = 0
for string in z:
score = len(string)-max([string.count('A'),string.count('C'), string.count('G'), string.count('T')])
totalscore += score
return totalscore
def random_seq():
dna_mapping = ['T', 'A', 'C', 'G']
return ''.join([dna_mapping[randrange(4)] for _ in range(3)])
motifs = [random_seq() for _ in range(1000000)]
print scoreMotifs(motifs)
motifs = ['GCG','AAG','AAG','ACG','CAA']
print scoreMotifs(motifs)
的總時間:
Total time: 0.61699 s
我會說,是一個相當不錯的改進。
目前還不清楚你的程序是做什麼的。圖片有點模棱兩可。我最好的猜測是:對於每一列(在你的圖片中),它取最常見的元素,並從該列中元素的總數中減去它。然後它把這些數字和總結起來? – Dair
是的你是對的。 –