python中的線程池沒有預期的那麼快

我是初學者到python和機器學習。我試圖使用多線程重現countvectorizer()的代碼。我正在使用yelp數據集來使用LogisticRegression進行情感分析。這是我到目前爲止已經寫的：python中的線程池沒有預期的那麼快

代碼片段：

from multiprocessing.dummy import Pool as ThreadPool 
from threading import Thread, current_thread 
from functools import partial 
data = df['text'] 
rev = df['stars'] 


y = [] 
def product_helper(args): 
    return featureExtraction(*args) 


def featureExtraction(p,t):  
    temp = [0] * len(bag_of_words) 
    for word in p.split(): 
     if word in bag_of_words: 
      temp[bag_of_words.index(word)] += 1 

    return temp 


# function to be mapped over 
def calculateParallel(threads): 
    pool = ThreadPool(threads) 
    job_args = [(item_a, rev[i]) for i, item_a in enumerate(data)] 
    l = pool.map(product_helper,job_args) 
    pool.close() 
    pool.join() 
    return l 

temp_X = calculateParallel(12)

這裏這只是部分代碼。

說明：

df['text']擁有所有的評論和df['stars']有評級（1到5）。我試圖找到使用多線程的字數向量temp_X。 bag_of_words是一些常用詞的選擇。

問：

沒有多線程，我能計算出temp_X在約24分鐘，上面的代碼了33分鐘，爲的大小100K審查的數據集。我的機器具有128GB的DRAM和12個內核（6個物理內核具有超線程，即每個內核的線程數= 2）。

我在這裏做錯了什麼？

來源

2016-11-25 bhaskar jupudi

你的整個代碼似乎CPU Bound而不是IO Bound。你只是使用threads這是GIL下如此有效地運行一個線程加上overheads.It運行在多個內核使用

使用

只有一個core.To運行

import multiprocessing 
pool = multiprocessing.Pool() 
l = pool.map_async(product_helper,job_args)

從multiprocessing.dummy進口池線程池是剛剛超過thread module.It利用剛剛one core和不多說一個包裝。

來源

2016-11-25 19:33:19 vks

非常感謝。我在這裏有更多的問題。我們在哪裏指定要使用的核心數量？ –

@bhaskarjupudi它會自動從multiprocessing.cpu_count（）中選擇可用內核的編號。 – vks

在您發佈的代碼片段中，l是一個對象。如何從該對象中檢索實際列表？ –

Python和線程不真的一起工作得很好。有一個已知的問題叫做GIL（全局Interperter鎖）。基本上，interperter中有一個鎖，它使所有線程不平行運行（即使你有多個cpu核心）。 Python會簡單地給每個線程一個接一個cpu的時間（以及它變慢的原因是線程間上下文切換的開銷）。

這是一個非常好的文件，解釋它是如何工作：http://www.dabeaz.com/python/UnderstandingGIL.pdf

解決您的問題，我建議你嘗試多處理： https://pymotw.com/2/multiprocessing/basics.html

注：多是不是100％equivilent多線程。多處理將並行運行，但不同的進程不會共享內存，所以如果在其中一個變量中更改變量，它將不會在其他進程中更改。

來源

2016-11-25 19:31:53 DorElias

python中的線程池沒有預期的那麼快

回答

相關問題