從文本中消除停用詞，同時不刪除重複的常規詞

我試圖用特定文本文件中最常見的50個單詞創建列表，但是我想消除該列表中的停用詞。我已經使用這個代碼做了。從文本中消除停用詞，同時不刪除重複的常規詞

from nltk.corpus import gutenberg 
carroll = nltk.Text(nltk.corpus.gutenberg.words('carroll-alice.txt')) 
carroll_list = FreqDist(carroll) 
stops = set(stopwords.words("english")) 
filtered_words = [word for word in carroll_list if word not in stops]

但是，這是刪除我想要的單詞的重複。就像當我這樣做：

fdist = FreqDist(filtered_words) 
fdist.most_common(50)

我得到的輸出：

[('right', 1), ('certain', 1), ('delighted', 1), ('adding', 1), 
('work', 1),  ('young', 1), ('Up', 1), ('soon', 1), ('use', 1),  
('submitted', 1), ('remedies', 1), ('tis', 1), ('uncomfortable', 1)....]

跟它有每個單詞的一個實例，顯然這消除了重複。我想保留重複的內容，以便我可以看到哪個單詞最常見。任何幫助將不勝感激。

來源

2016-09-21 Cody

請發佈[最小，完整，可驗證的示例]（http://stackoverflow.com/help/mcve）。沒有原始列表和其他支持項目，我們無法重現您的問題。看起來你只有一次過濾過的單詞，而不是原始文本的全部頻率。 – Prune

正如你現在寫的，list已經包含單詞作爲鍵和發生次數作爲值分佈：

>>> list 
FreqDist({u',': 1993, u"'": 1731, u'the': 1527, u'and': 802, u'.': 764, u'to': 725, u'a': 615, u'I': 543, u'it': 527, u'she': 509, ...})

然後遍歷鍵這意味着每個字僅出現一次。我相信你真的想創建filtered_words這樣的：

filtered_words = [word for word in carroll if word not in stops]

此外，你應該儘量避免使用Python的搭配內建函數（list是一個Python內建函數）的變量名。

來源

2016-09-21 22:13:31 FamousJameous

從文本中消除停用詞，同時不刪除重複的常規詞

回答

相關問題