2011-03-30 70 views
73

如何刪除停止的話讓我有我想從使用使用NLTK或Python

stopwords.words('english') 

去除停止詞我掙扎如何使用這個我的代碼內只是單純地拿出一個數據集這些字。我的單詞的列表,從這個數據集已經,我用的比較這個列表,而刪除停用詞掙扎的一部分。 任何幫助表示讚賞。

+4

你從哪裏得到的禁用詞?這是NLTK嗎? – 2014-04-07 22:15:14

+25

@ MattO'Brien'from nltk.corpus import stopwords' for future googlers – danodonovan 2015-05-13 21:11:43

+11

爲了使停用詞典可用,還需要運行'nltk.download(「stopwords」)''。 – sffc 2015-07-10 17:12:03

回答

14

我想你必須從中要刪除禁用詞字(WORD_LIST)的列表。你可以做這樣的事情:

filtered_word_list = word_list[:] #make a copy of the word_list 
for word in word_list: # iterate over word_list 
    if word in stopwords.words('english'): 
    filtered_word_list.remove(word) # remove word from filtered_word_list if it is a stopword 
+3

這將比Daren Thomas的列表理解慢很多... – drevicko 2016-08-26 10:54:01

147
from nltk.corpus import stopwords 
# ... 
filtered_words = [word for word in word_list if word not in stopwords.words('english')] 
+0

由於這兩個答案,他們都工作,雖然它會我似乎有一個缺陷在我的代碼阻止正常工作的停止列表。這應該是一個新的問題嗎?不確定這裏的事情是如何運作的! – Alex 2011-03-30 14:29:58

+29

爲了提高性能,請考慮''stops = set(stopwords.words(「english」))'''代替。 – isakkarlsson 2013-09-07 22:04:31

+1

>>> import nltk >>> nltk.download() [Source](http://www.nltk.org/data.html) – 2017-12-14 20:33:51

19

你也可以做一組差異,例如:

list(set(nltk.regexp_tokenize(sentence, pattern, gaps=True)) - set(nltk.corpus.stopwords.words('english'))) 
+6

注意:這會將句子轉換爲SET,以刪除所有重複的單詞因此您將無法對結果使用頻率計數 – 2017-02-21 23:59:40

0
import sys 
print ("enter the string from which you want to remove list of stop words") 
userstring = input().split(" ") 
list =["a","an","the","in"] 
another_list = [] 
for x in userstring: 
    if x not in list:   # comparing from the list and removing it 
     another_list.append(x) # it is also possible to use .remove 
for x in another_list: 
    print(x,end=' ') 

    # 2) if you want to use .remove more preferred code 
    import sys 
    print ("enter the string from which you want to remove list of stop words") 
    userstring = input().split(" ") 
    list =["a","an","the","in"] 
    another_list = [] 
    for x in userstring: 
     if x in list:   
      userstring.remove(x) 
    for x in userstring:   
     print(x,end = ' ') 
    #the code will be like this 
0

你可以使用這個功能,你應該注意,您需要降低所有單詞

from nltk.corpus import stopwords 

def remove_stopwords(word_list): 
     processed_word_list = [] 
     for word in word_list: 
      word = word.lower() # in case they arenet all lower cased 
      if word not in stopwords.words("english"): 
       processed_word_list.append(word) 
     return processed_word_list 
1

使用filter

from nltk.corpus import stopwords 
# ... 
filtered_words = list(filter(lambda word: word not in stopwords.words('english'), word_list)) 
4

排除所有類型的禁用詞包括NLTK停止字,你可以做這樣的事情:

from many_stop_words import get_stop_words 
from nltk.corpus import stopwords 

stop_words = list(get_stop_words('en'))   #About 900 stopwords 
nltk_words = list(stopwords.words('english')) #About 150 stopwords 
stop_words.extend(nltk_words) 

output = [w for w in word_list if not w in stop_words]