2017-05-25 88 views
0

我有一個簡單的數據框,有兩列。自動多處理數據幀列上的「函數應用」

+---------+-------+ | subject | score | 
+---------+-------+ | wow  | 0  | 
+---------+-------+ | cool | 0  | 
+---------+-------+ | hey  | 0  | 
+---------+-------+ | there | 0  | 
+---------+-------+ | come on | 0  | 
+---------+-------+ | welcome | 0  | 
+---------+-------+ 

對於「主題」列中的每個記錄,我打電話的功能和更新列「分數」的結果:

df['score'] = df['subject'].apply(find_score) 

Here find_score is a function, which processes strings and returns a score : 

def find_score (row): 
    # Imports the Google Cloud client library 
    from google.cloud import language 

    # Instantiates a client 
    language_client = language.Client() 

    import re 
    pre_text = re.sub('<[^>]*>', '', row) 
    text = re.sub(r'[^\w]', ' ', pre_text) 

    document = language_client.document_from_text(text) 

    # Detects the sentiment of the text 
    sentiment = document.analyze_sentiment().sentiment 

    print("Sentiment score - %f " % sentiment.score) 

    return sentiment.score 

這是預期,但它很慢,因爲它處理工作正常一一記錄。

有沒有辦法,這可以平行嗎?無需手動將數據幀分成更小的塊?有沒有任何圖書館可以自動執行此操作?

乾杯

+0

你可以顯示你的find_score func的def嗎? – Allen

+0

考慮使用dask – Boud

+0

@Allen我已經添加了函數def的問題 – gnanagurus

回答

3

language.Client每次實例調用find_score功能可能是一個主要瓶頸。你並不需要創建一個新的客戶端實例爲每個使用的功能,所以儘量的功能之外創建了它,你怎麼稱呼它之前:

# Instantiates a client 
language_client = language.Client() 

def find_score (row): 
    # Imports the Google Cloud client library 
    from google.cloud import language 


    import re 
    pre_text = re.sub('<[^>]*>', '', row) 
    text = re.sub(r'[^\w]', ' ', pre_text) 

    document = language_client.document_from_text(text) 

    # Detects the sentiment of the text 
    sentiment = document.analyze_sentiment().sentiment 

    print("Sentiment score - %f " % sentiment.score) 

    return sentiment.score 

df['score'] = df['subject'].apply(find_score) 

如果你堅持,你可以使用多這樣的:

from multiprocessing import Pool 
# <Define functions and datasets here> 
pool = Pool(processes = 8) # or some number of your choice 
df['score'] = pool.map(find_score, df['subject']) 
pool.terminate()