python在數據框中的快速文本處理

我正在研究python中的電子商務數據。我已經將這些數據加載到python中並將其轉換爲熊貓數據框架。現在，我想對數據執行文本處理，例如刪除不需要的字符，停用詞，詞幹等。目前我應用的代碼工作正常，但需要很長時間。我有大約200萬行數據需要處理，並且需要永久處理。我在10,000行上試過這個代碼，花了大約240秒。我正在進行這種項目，這是第一次。任何減少時間的幫助都會很有幫助。python在數據框中的快速文本處理

在此先感謝。

from nltk.stem import PorterStemmer 
from nltk.corpus import stopwords 
import re 

def textprocessing(text): 
    stemmer = PorterStemmer() 
    # Remove unwanted characters 
    re_sp= re.sub(r'\s*(?:([^a-zA-Z0-9._\s "])|\b(?:[a-z])\b)'," ",text.lower()) 
    # Remove single characters 
    no_char = ' '.join([w for w in re_sp.split() if len(w)>1]).strip() 
    # Removing Stopwords 
    filtered_sp = [w for w in no_char.split(" ") if not w in stopwords.words('english')] 
    # Perform Stemming 
    stemmed_sp = [stemmer.stem(item) for item in filtered_sp] 
    # Converting it to string 
    stemmed_sp = ' '.join([x for x in stemmed_sp]) 
    return stemmed_sp

我呼籲該數據幀這種方法：

files['description'] = files.loc[:,'description'].apply(lambda x: textprocessing(str(x)))

您可以採取的任何數據，按您的方便。由於某些政策，我無法分享數據。

來源

2017-10-13 Sam

一個快速的變化，可以幫助：它看起來像停用詞通常是一個列表，並有2400個條目。使它成爲一個集合可以大大加快'停用詞'中的'如果不是w'。首先嚐試對較小的提取物進行更改。此外，應用似乎有時比正常的列表理解有時慢 - 可能值得提取列，做你的代碼（這實際上是一個很好的處理）作爲列表理解，然後重新插入... –

我以前經歷過對熊貓的「應用」要比在列表或字典等其他結構中應用函數慢得多。是否有一個特定的原因，你希望他們在'pandas.DataFrame'中？你有沒有考慮過使用另一個？ –

我正在通過數據庫加載它。這就是爲什麼我將它轉換爲DataFrame來處理它。是否有其他數據存儲選項，我可以輕鬆應用和工作？ – Sam

你可以嘗試去完成它在一個循環，而不是創建詞幹/ stop_word每次循環

STEMMER = PorterStemmer() 
    STOP_WORD = stopwords.words('english') 
    def textprocessing(text): 

    return ''.join(STEMMER.stem(item) for token in re.sub(r'\s*(?:([^a-zA-Z0-9._\s "])|\b(?:[a-z])\b)'," ",text.lower()).split() if token not in STOP_WORD and len(token) > 1)

你也可以使用NLTK去除unwant字

from nltk.tokenize import RegexpTokenizer 
STEMMER = PorterStemmer() 
STOP_WORD = stopwords.words('english') 
TOKENIZER = RegexpTokenizer(r'\w+') 
def textprocessing(text): 
    return ''.join(STEMMER.stem(item) for token in TOKENIZER.tokenize(test.lower()) if token not in STOP_WORD and len(token) > 1)

來源

2017-10-13 14:30:07 galaxyan

謝謝.. !!!它確實提高了幾次速度，而且代碼中的修復很少。 – Sam

python在數據框中的快速文本處理

回答

相關問題