詞幹與NLTK（python）

我是新來的Python文本處理，我試圖阻止詞在文本文件中，有大約5000行。詞幹與NLTK（python）

我寫了下面的腳本

from nltk.corpus import stopwords # Import the stop word list 
from nltk.stem.snowball import SnowballStemmer 

stemmer = SnowballStemmer('english') 

def Description_to_words(raw_Description): 
    # 1. Remove HTML 
    Description_text = BeautifulSoup(raw_Description).get_text() 
    # 2. Remove non-letters   
    letters_only = re.sub("[^a-zA-Z]", " ", Description_text) 
    # 3. Convert to lower case, split into individual words 
    words = letters_only.lower().split()      

    stops = set(stopwords.words("english"))     
    # 5. Remove stop words 
    meaningful_words = [w for w in words if not w in stops] 
    # 5. stem words 
    words = ([stemmer.stem(w) for w in words]) 

    # 6. Join the words back into one string separated by space, 
    # and return the result. 
    return(" ".join(meaningful_words)) 

clean_Description = Description_to_words(train["Description"][15])

但是當我測試的結果的話被未去梗，誰能幫助我知道什麼是問題，我做的「Description_to_words」功能不對勁

而且，當我像下面那樣單獨執行幹命令時，它就起作用了。

from nltk.tokenize import sent_tokenize, word_tokenize 
>>> words = word_tokenize("MOBILE APP - Unable to add reading") 
>>> 
>>> for w in words: 
...  print(stemmer.stem(w)) 
... 
mobil 
app 
- 
unabl 
to 
add 
read

來源

2017-08-14 user3734568

下面是您的功能的每一步，修復。

刪除HTML。

Description_text = BeautifulSoup(raw_Description).get_text()

除去非字母，但不刪除空格，只是還沒有。你也可以簡化你的正則表達式。
```
letters_only = re.sub("[^\w\s]", " ", Description_text) 
```

轉換爲小寫，分成獨立的話：我建議再次使用word_tokenize，在這裏。

from nltk.tokenize import word_tokenize 
words = word_tokenize(letters_only.lower())

刪除停用詞。

stops = set(stopwords.words("english")) 
meaningful_words = [w for w in words if not w in stops]

詞幹。這是另一個問題。莖meaningful_words，而不是words。
```
return ' '.join(stemmer.stem(w) for w in meaningful_words]) 
```

來源

2017-08-14 08:46:23

這很簡單。非常感謝您的回覆。有用。我很高興:) – user3734568

只是一個問題，我們可以在詞形化詞中使用相同的邏輯.lemmatize（）正確 – user3734568

@ user3734568是的，你可以，只需將'stemmer.stem（w）'改爲'lemmatizer.lemmatize（word） ' –

詞幹與NLTK（python）

回答

相關問題