nltk停用詞刪除給出了錯誤的輸出

我在刪除停用詞時遇到了問題。當我執行我的腳本時：'nltk停用詞刪除給出了錯誤的輸出

import nltk 
from nltk.corpus import stopwords 
file1=open('english.txt', 'r') 
english=file1.read() 
file1.close() 
english_corpus_lowercase =([w.lower() for w in english]) 
english_without_punc=''.join([c for c in english_corpus_lowercase if c not in (",", "``", "`", "?", ".", ";", ":", "!", "''", "'", '"', "-", "(", ")")]) 
print(english_without_punc) 
print(type(english_without_punc)) 
stopwords = nltk.corpus.stopwords.words('english') 
print(stopwords) 
english_corpus_sans_stopwords = set() 
for w in english_without_punc: 
    if w not in stopwords: 
     english_corpus_sans_stopwords.add(w) 
     print(english_corpus_sans_stopwords)

它給了我下面的內容。我怎麼修復它？

{'b', 'n', 'f', 'l', 'v', 'h', 'k', 'e', 'r', ' ', 'w', '「', 'g', 'u', 'p', 'c'} 
{'b', 'n', 'f', 'l', 'v', 'h', 'k', 'e', 'r', ' ', 'w', '「', 'g', 'u', 'p', 'c'} 
{'b', 'n', 'f', 'l', 'v', 'h', 'k', 'e', 'r', ' ', 'w', '「', 'g', 'u', 'p', 'c'} 
{'b', 'n', 'f', 'l', 'v', 'h', 'k', 'e', 'r', ' ', 'w', '「', 'g', 'u', 'p', 'c'} 
{'b', 'n', 'f', 'l', 'v', 'h', 'k', 'e', 'r', ' ', 'w', '「', 'g', 'u', 'p', 'c'} 
{'b', 'n', 'f', 'l', 'v', 'h', 'k', 'e', 'r', ' ', 'w', '「', 'g', 'u', 'p', 'c'} 
{'b', 'n', 'f', 'l', 'v', 'h', 'k', 'e', 'r', ' ', 'w', '「', 'g', 'u', 'p', 'c'} 
{'b', 'n', 'f', 'l', 'v', 'h', 'k', 'e', 'r', ' ', 'w', '「', 'g', 'u', 'p', 'c'} 
{'b', 'n', 'f', 'l', 'v', 'h', 'k', 'e', 'r', ' ', 'w', '「', 'g', 'u', 'p', 'c'}

來源

2017-08-11 Miss Alena

你'english_corpus_lowercase'不是單詞的列表，而是一個字符串。你必須首先標記它。 – DyZ

作爲一個方面說明，因爲「''」等不是單字符字符串，它們將永遠不會從您的文本中刪除。 – DyZ

嘗試以下：

import nltk 
from nltk.corpus import stopwords 
from nltk import word_tokenize 

file1 = open('english.txt', 'r') 
english = file1.read() 
file1.close() 

english_corpus_lowercase = [w.lower() for w in word_tokenize(english)] 
english_without_punc = [c for c in english_corpus_lowercase if c not in (",", "``", "`", "?", ".", ";", ":", "!", "''", "'", '"', "-", "(", ")")] 
english_corpus_sans_stopwords = [] 
stopwords = nltk.corpus.stopwords.words('english') 

for w in english_without_punc: 
    if w not in stopwords: 
     english_corpus_sans_stopwords.append(w) 
print(english_corpus_sans_stopwords)

來源

2017-08-11 23:01:50 Andras

非常感謝！它的工作原理是完美無瑕的）） –

不客氣，訣竅就是使用'word_tokenize'，它爲您處理繁重的工作:) – Andras

nltk停用詞刪除給出了錯誤的輸出

回答

相關問題