我正在寫一段很長的代碼,這段代碼太長而無法執行。我在代碼上使用了cProfile,我發現下面的函數被調用了150次,每次調用需要1.3秒,導致這個函數大約需要200秒。功能是 -該功能可以針對速度進行優化嗎?
def makeGsList(sentences,org):
gs_list1=[]
gs_list2=[]
for s in sentences:
if s.startswith(tuple(StartWords)):
s = s.lower()
if org=='m':
gs_list1 = [k for k in m_words if k in s]
if org=='h':
gs_list1 = [k for k in h_words if k in s]
for gs_element in gs_list1:
gs_list2.append(gs_element)
gs_list3 = list(set(gs_list2))
return gs_list3
該代碼應該是一個句子列表和一個標誌org
。然後,它會遍歷每一行,檢查它是否以列表StartWords
中的任何單詞開頭,然後小寫它。然後,根據org
的值,它會列出當前句子中的所有單詞,這些單詞也存在於m_words
或h_words
中。它不斷將這些單詞附加到另一個列表gs_list2
。最後它會生成一組gs_list2
並返回它。
有人可以給我任何關於如何優化此功能以減少執行時間的建議嗎?
備註 - 單詞h_words
/m_words
並不都是單個單詞,其中很多單詞都是包含3-4個單詞的短語。
一些例子 -
StartWords = ['!Series_title','!Series_summary','!Series_overall_design','!Sample_title','!Sample_source_name_ch1','!Sample_characteristics_ch1']
sentences = [u'!Series_title\t"Transcript profiles of DCs of PLOSL patients show abnormalities in pathways of actin bundling and immune response"\n', u'!Series_summary\t"This study was aimed to identify pathways associated with loss-of-function of the DAP12/TREM2 receptor complex and thus gain insight into pathogenesis of PLOSL (polycystic lipomembranous osteodysplasia with sclerosing leukoencephalopathy). Transcript profiles of PLOSL patients\' DCs showed differential expression of genes involved in actin bundling and immune response, but also for the stability of myelin and bone remodeling."\n', u'!Series_summary\t"Keywords: PLOSL patient samples vs. control samples"\n', u'!Series_overall_design\t"Transcript profiles of in vitro differentiated DCs of three controls and five PLOSL patients were analyzed."\n', u'!Series_type\t"Expression profiling by array"\n', u'!Sample_title\t"potilas_DC_A"\t"potilas_DC_B"\t"potilas_DC_C"\t"kontrolli_DC_A"\t"kontrolli_DC_C"\t"kontrolli_DC_D"\t"potilas_DC_E"\t"potilas_DC_D"\n', u'!Sample_characteristics_ch1\t"in vitro differentiated DCs"\t"in vitro differentiated DCs"\t"in vitro differentiated DCs"\t"in vitro differentiated DCs"\t"in vitro differentiated DCs"\t"in vitro differentiated DCs"\t"in vitro differentiated DCs"\t"in vitro differentiated DCs"\n', u'!Sample_description\t"DAP12mut"\t"DAP12mut"\t"DAP12mut"\t"control"\t"control"\t"control"\t"TREM2mut"\t"TREM2mut"\n']
h_words = ['pp1665', 'glycerophosphodiester phosphodiesterase domain containing 5', 'gde2', 'PLOSL patients', 'actin bundling', 'glycerophosphodiester phosphodiesterase 2', 'glycerophosphodiester phosphodiesterase domain-containing protein 5']
m_words是相似的。
關於尺寸 -
兩個列表h_words
的長度和m_words
是大約250,000。列表中的每個元素平均長2個字。句子的列表長度大約爲10-20個句子,我提供了一個示例列表,讓您瞭解每個句子的大小。
應該去的代碼審查stackexchange如果你的代碼是工作的罰款 –
1.想想每次迭代之後什麼在'gs_list1'。 2.爲什麼不開始*設置? – jonrsharpe
你能保證'org'將會'm'或'h'嗎? –