自動多處理數據幀列上的「函數應用」

我有一個簡單的數據框，有兩列。自動多處理數據幀列上的「函數應用」

+---------+-------+ | subject | score | 
+---------+-------+ | wow  | 0  | 
+---------+-------+ | cool | 0  | 
+---------+-------+ | hey  | 0  | 
+---------+-------+ | there | 0  | 
+---------+-------+ | come on | 0  | 
+---------+-------+ | welcome | 0  | 
+---------+-------+

對於「主題」列中的每個記錄，我打電話的功能和更新列「分數」的結果：

df['score'] = df['subject'].apply(find_score) 

Here find_score is a function, which processes strings and returns a score : 

def find_score (row): 
    # Imports the Google Cloud client library 
    from google.cloud import language 

    # Instantiates a client 
    language_client = language.Client() 

    import re 
    pre_text = re.sub('<[^>]*>', '', row) 
    text = re.sub(r'[^\w]', ' ', pre_text) 

    document = language_client.document_from_text(text) 

    # Detects the sentiment of the text 
    sentiment = document.analyze_sentiment().sentiment 

    print("Sentiment score - %f " % sentiment.score) 

    return sentiment.score

這是預期，但它很慢，因爲它處理工作正常一一記錄。

有沒有辦法，這可以平行嗎？無需手動將數據幀分成更小的塊？有沒有任何圖書館可以自動執行此操作？

乾杯

來源

2017-05-25 gnanagurus

你可以顯示你的find_score func的def嗎？ – Allen

考慮使用dask – Boud

@Allen我已經添加了函數def的問題 – gnanagurus

的language.Client每次實例調用find_score功能可能是一個主要瓶頸。你並不需要創建一個新的客戶端實例爲每個使用的功能，所以儘量的功能之外創建了它，你怎麼稱呼它之前：

# Instantiates a client 
language_client = language.Client() 

def find_score (row): 
    # Imports the Google Cloud client library 
    from google.cloud import language 


    import re 
    pre_text = re.sub('<[^>]*>', '', row) 
    text = re.sub(r'[^\w]', ' ', pre_text) 

    document = language_client.document_from_text(text) 

    # Detects the sentiment of the text 
    sentiment = document.analyze_sentiment().sentiment 

    print("Sentiment score - %f " % sentiment.score) 

    return sentiment.score 

df['score'] = df['subject'].apply(find_score)

如果你堅持，你可以使用多這樣的：

from multiprocessing import Pool 
# <Define functions and datasets here> 
pool = Pool(processes = 8) # or some number of your choice 
df['score'] = pool.map(find_score, df['subject']) 
pool.terminate()

來源

2017-05-25 07:19:03

自動多處理數據幀列上的「函數應用」

回答

相關問題