Python：預處理文本

我正在嘗試使用lemmatizer預處理一個字符串，然後刪除標點符號和數字。我正在使用下面的代碼來執行此操作。我沒有收到任何錯誤，但文本沒有被適當地預處理。只有停用詞被刪除，但詞彙化不起作用，標點和數字也保留。Python：預處理文本

from nltk.stem import WordNetLemmatizer 
import string 
import nltk 
tweets = "This is a beautiful day16~. I am; working on an exercise45.^^^45 text34." 
lemmatizer = WordNetLemmatizer() 
tweets = lemmatizer.lemmatize(tweets) 
data=[] 
stop_words = set(nltk.corpus.stopwords.words('english')) 
words = nltk.word_tokenize(tweets) 
words = [i for i in words if i not in stop_words] 
data.append(' '.join(words)) 
corpus = " ".join(str(x) for x in data) 
p = string.punctuation 
d = string.digits 
table = str.maketrans(p, len(p) * " ") 
corpus.translate(table) 
table = str.maketrans(d, len(d) * " ") 
corpus.translate(table) 
print(corpus)

最終輸出我得到的是：

This beautiful day16~ . I ; working exercise45.^^^45 text34 .

和預期的輸出應該是這樣的：

This beautiful day I work exercise text

來源

2017-10-16 Alex

我會使用正則表達式來擺脫噪音，調用lemmatizer之前。 –

謝謝你的建議。但是，上面的代碼不應該像我期待的那樣工作。我以前使用過相同的代碼，但它工作正常，但不知道爲什麼這次不工作。 – Alex

不，你目前的方法是行不通的，因爲你必須在一個時間通過一個字lemmatizer /詞幹，否則，這些功能將不知道要解釋你的字符串作爲句子（他們期待的話）。

import re __stop_words = set(nltk.corpus.stopwords.words('english')) def clean(tweet): cleaned_tweet = re.sub(r'([^\w\s]|\d)+', '', tweets.lower()) return ' '.join([lemmatizer.lemmatize(i, 'v') for i in cleaned_tweet.split() if i not in __stop_words])

或者，你可以使用一個PorterStemmer，它做同樣的事情lemmatisation，但沒有上下文。

from nltk.stem.porter import PorterStemmer stemmer = PorterStemmer()

而且，這樣調用的詞幹：

stemmer.stem(i)

來源

2017-10-16 21:43:59

嘿，你還可以告訴我，如果我的文本是一個數據幀列，我如何預處理文本。我想刪除所有標點符號，數字和詞彙化文本，並從一列數據框的所有行中刪除停用詞。 – Alex

@Ritika定義一個函數，然後將其傳遞給df.apply ... –

非常感謝:) – Alex

我想這就是你要找的東西，但做到這一點之前作爲評論者注意到稱爲lemmatizer。

>>>import re 
>>>s = "This is a beautiful day16~. I am; working on an exercise45.^^^45text34." 
>>>s = re.sub(r'[^A-Za-z ]', '', s) 
This is a beautiful day I am working on an exercise text

來源

2017-10-16 21:37:15

Python：預處理文本

回答

相關問題