Python打印速度非常慢

我遇到了一個我從未遇到過的問題，這讓我非常沮喪。我正在使用rpy2從python腳本中與R接口，並規範化數組。出於某種原因，當我將打印輸出一起打印到文件時，需要打印年齡。它也會隨着它的進行而減慢，直到它可能每分鐘輸出幾kb的數據。Python打印速度非常慢

我的輸入文件很大（366 MB），但是這是在高性能計算羣集上運行的，具有近似無限的資源。這應該沒有問題。

這裏就是我實際上做歸一化：

matrix = sample_list # two-dimensional array 
v = robjects.FloatVector([ element for col in matrix for element in col ]) 
m = robjects.r['matrix'](v, ncol = len(matrix), byrow=False) 
print("Performing quantile normalization.") 
Rnormalized_matrix = preprocessCore.normalize_quantiles(m) 
normalized_matrix = np.array(Rnormalized_matrix)

正如你所看到的，我結束了我的含現在規範化的數據numpy.array對象。我還有另一個包含其他字符串的列表，我想將其輸出到輸出中，每個元素對應於numpy數組的一個元素。所以我迭代，將數組的每一行連接到一個字符串並打印輸出。

for thing in pos_list: # List of strings corresponding with each row of array. 
    thing_index = pos_list.index(thing) 

    norm_data = normalized_matrix[thing_index] 
    out_data = "\t".join("{0:.2f}".format(piece) for piece in norm_data) 

    print(thing + "\t" + out_data, file=output)

我不是職業球員，但我不知道爲什麼事情都放緩了這麼多。任何見解或建議將非常非常感激。如果有人認爲它可能有幫助，我可以發佈更多/腳本的其餘部分。

更新： 感謝@lgautier提供的分析建議。使用line_profiler模塊，我能我的問題找準前往路線： thing_index = pos_list.index(thing)

這是有道理的，因爲這個名單很長，而且也解釋了減緩的劇本進行。只需使用計數來解決問題。

的原代碼分析（注意指定的線路％）：

Line #  Hits   Time Per Hit % Time Line Contents 
    115   1  16445761 16445761.0  15.5  header, pos_list, normalized_matrix = Quantile_Normalize(in 
    117   1   54  54.0  0.0   print("Creating output file...") 
    120   1   1450 1450.0  0.0   output = open(output_file, "w") 
    122   1   8  8.0  0.0   print(header, file=output) 
    124             # Iterate through each position and print QN'd data 
    125 100000  74600  0.7  0.1   for thing in pos_list: 
    126  99999  85244758 852.5  80.3     thing_index = pos_list.index(thing) 
    129  99999  158741  1.6  0.1     norm_data = normalized_matrix[thing_index] 
    130  99999  3801631  38.0  3.6     out_data = "\t".join("{0:.2f}".format(piece) for pi 
    132  99999  384248  3.8  0.4     print(thing + "\t" + out_data, file=output) 
    134   1   3641 3641.0  0.0   output.close()

剖析新代碼：

Line #  Hits   Time Per Hit % Time Line Contents 
    115   1  16177130 16177130.0  82.5  header, pos_list, normalized_matrix = Quantile_Normalize(input_file, data_start) 
    116 
    117   1   55  55.0  0.0   print("Creating output file...") 
    118 
    119 
    120   1  26157 26157.0  0.1   output = open(output_file, "w") 
    121 
    122   1   11  11.0  0.0   print(header, file=output) 
    123 
    124             # Iterate through each position and print QN'd data 
    125   1   1  1.0  0.0   count = 0 
    126 100000  62709  0.6  0.3   for thing in pos_list: 
    127  99999  58587  0.6  0.3     thing_index = count 
    128  99999  67164  0.7  0.3     count += 1 
    131  99999  85664  0.9  0.4     norm_data = normalized_matrix[thing_index] 
    132  99999  2877634  28.8  14.7     out_data = "\t".join("{0:.2f}".format(piece) for piece in norm_data) 
    134  99999  240654  2.4  1.2     print(thing + "\t" + out_data, file=output) 
    136   1   1713 1713.0  0.0   output.close()

來源

2016-06-10 Jared Andrews

pos_list是否包含R對象？我不經常使用rpy2，但根據我的經驗，這兩種語言之間的交互相當緩慢。 –

什麼部分實際上很慢？嘗試評論一些零碎的東西，看看是什麼讓它變得更快。我期望'[col中元素的矩陣元素]'很慢。 –

不，pos_list只包含字符串。 col中元素列表中的元素[col for col]中的元素是緩慢的，但它在任何實際輸出到文件之前，所以它不是這裏的瓶頸。 –

如果我理解這個正確的一切都運行良好，並具有良好的性能高達（包括）行：

normalized_matrix = np.array(Rnormalized_matrix)

在該行所得到的矩陣變成一個numpy的陣列（字面意思 - 它在避免複製數據時可以更快，如http://rpy2.readthedocs.io/en/version_2.8.x/numpy.html?from-rpy2-to-numpy）。

我看不到有關rpy2的腳本其餘部分的性能問題。

現在可能發生的情況是，這不是因爲它在標籤上顯示「HPC」，它在所有代碼的任何情況下都具有高性能。您是否考慮過通過代碼分析器運行最後一個循環？它會告訴你時間花在哪裏。

來源

2016-06-10 11:34:19 lgautier

因此，使用'line_profiler'模塊，我能夠確定~80％的時間花在這一行上：'thing_index = pos_list.index（thing）'，我猜這應該是明顯的，因爲該文件是數以百萬計的線路，並隨着它的進一步進一步搜索而變慢。我用一個計數器取代了它，並且看到了大幅加速。對我而言，這是一個巨大的疏忽，但隨着你的回答讓我走向正確的策略，我將其標記爲答案。我會更新我的問題以反映分析和更改。 –

一兩件事，我通常使用一臺發電機，以避免臨時列表許多細小的弦。

out_data = "\t".join("{0:.2f}".format(piece) for piece in norm_data)

但是很難說這部分是否是緩慢的。

來源

2016-06-10 11:09:19

謝謝，這是一個很好的提示。我已經用它更新了我的代碼，儘管它沒有解決問題。 –

Python打印速度非常慢

回答

相關問題