如何從文本文件中刪除停用詞而不刪除空格

-2

我必須從包含50K推文的文本文件中刪除停用詞。當我運行此代碼時，它會成功刪除停用詞，但同時它也會刪除空格。我想在文本中使用空格。如何從文本文件中刪除停用詞而不刪除空格

from nltk.tokenize import word_tokenize 
from nltk.corpus import stopwords 
import codecs 

import nltk 

stopset = set(stopwords.words('english')) 

writeFile = codecs.open("outputfile", "w", encoding='utf-8') 

with codecs.open("inputfile", "r", encoding='utf-8') as f: 
      line = f.read() 
      tokens = nltk.word_tokenize(line) 
      tokens = [w for w in tokens if not w in stopset] 
      for token in tokens: 
       writeFile.write(token)

來源

2015-02-11 ALphaCS

當你寫的時候，在你想要的空白處寫上空格。在具體的情況下，每個標記後面的換行符看起來都合適，因爲您已經在查看所有其他格式。使用print代替write確實是不需要你有一個明確的換行符來標記：

from __future__ import print_function # if you're on Python 2 
# ... 
for token in tokens: 
    print(token, file=writeFile)

另外，如果你想空間，而不是換行，把空間。如果您有令牌的數量有限，你可以只

print(' '.join(tokens), file=writeFile)

但這會吃內存的料塊在打印之前一起加入字符串，所以遍歷的標記會更經濟。但是，因爲您一次處理一條線，所以加入可能足夠好，並且會在一條輸出線上將來自一條輸入線的令牌集合在一起。

如果您有大量每行的令牌，並希望循環在他們的記憶效率，一個常見的成語是聲明一個分離器最初是空的：

sep = '' 
for token in tokens: 
    writeFile.write('{}{}'.format(sep, token)) # str.format(): py >= 2.6 
    sep=' ' 
writeFile.write('\n')

來源

2015-02-11 04:13:47 tripleee

那麼你將結束一條很長的路線，但對你更有力量。 – tripleee 2015-02-11 04:44:57

在單詞之間放置空格。 – tripleee 2015-02-11 04:51:04

它不可行，因爲這個文件有超過50000行 – ALphaCS 2015-02-11 04:52:16

如何從文本文件中刪除停用詞而不刪除空格

回答

相關問題