UnicodeDecodeError在NLTK中讀取自定義創建的語料庫時

我使用nltk模塊製作了用於檢測句子極性的自定義語料庫。這裏是語料庫的層次：UnicodeDecodeError在NLTK中讀取自定義創建的語料庫時

極性
--polar
---- polar_tweets.txt
--nonpolar
---- nonpolar_tweets.txt

這裏是如何我導入一個語料庫在我的源代碼：

polarity = LazyCorpusLoader('polar', CategorizedPlaintextCorpusReader, r'(?!\.).*\.txt', cat_pattern=r'(polar|nonpolar)/.*', encoding='utf-8') 
corpus = polarity 
print(corpus.words(fileids=['nonpolar/non-polar.txt']))

，但它提出了以下錯誤：

Traceback (most recent call last): 
    File "E:/Analytics Practice/Social Media Analytics/analyticsPlatform/DataAnalysis/SentimentAnalysis/data/training_testing_data.py", line 9, in <module> 
    print(corpus.words(fileids=['nonpolar/nonpolar_tweets.txt'])) 
    File "E:\Analytics Practice\Social Media Analytics\analyticsPlatform\lib\site-packages\nltk\util.py", line 765, in __repr__ 
    for elt in self: 
    File "E:\Analytics Practice\Social Media Analytics\analyticsPlatform\lib\site-packages\nltk\corpus\reader\util.py", line 291, in iterate_from 
    tokens = self.read_block(self._stream) 
    File "E:\Analytics Practice\Social Media Analytics\analyticsPlatform\lib\site-packages\nltk\corpus\reader\plaintext.py", line 122, in _read_word_block 
    words.extend(self._word_tokenizer.tokenize(stream.readline())) 
    File "E:\Analytics Practice\Social Media Analytics\analyticsPlatform\lib\site-packages\nltk\data.py", line 1135, in readline 
    new_chars = self._read(readsize) 
    File "E:\Analytics Practice\Social Media Analytics\analyticsPlatform\lib\site-packages\nltk\data.py", line 1367, in _read 
    chars, bytes_decoded = self._incr_decode(bytes) 
    File "E:\Analytics Practice\Social Media Analytics\analyticsPlatform\lib\site-packages\nltk\data.py", line 1398, in _incr_decode 
    return self.decode(bytes, 'strict') 
    File "C:\Users\prabhjot.rai\AppData\Local\Continuum\Anaconda3\lib\encodings\utf_8.py", line 16, in decode 
    return codecs.utf_8_decode(input, errors, True) 
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 269: invalid continuation byte

在創建文件polar_tweets.txt和nonpolar_tweets.txt，我解碼文件uncleaned_polar_tweets.txt到utf-8，然後將其寫入文件polar_tweets.txt。下面是該代碼：

with open(path_to_file, "rb") as file: 
    output_corpus = clean_text(file.read().decode('utf-8'))['cleaned_corpus'] 

output_file = open(output_path, "w") 
output_file.write(output_corpus) 
output_file.close()

其中output_file是polar_tweets.txt和nonpolar_tweets.txt。錯誤在哪裏？因爲我在utf-8編碼開始，然後也由線

polarity = LazyCorpusLoader('polar', CategorizedPlaintextCorpusReader, r'(?!\.).*\.txt', cat_pattern=r'(polar|nonpolar)/.*', encoding='utf-8')

閱讀utf-8如果我通過encoding='latin-1'更換encoding='utf-8'，代碼工作完美。問題在哪裏？在創建語料庫時，我還需要在utf-8中解碼嗎？

來源

2016-09-16 Prabhjot Rai

您的術語已關閉。閱讀時，你從*解碼*。錯誤表明，語料庫（或其中的一部分）不是有效的UTF-8。如果沒有訪問有問題的數據，我們只能推測。什麼'LC_ALL = C grep -m 1 $'\ xC2'nonpolar_tweets.txt'產生？（也許管道到'xxd'或類似的精確查看字節。） – tripleee

...或在Python中的等價物 - 讀取違規行，然後檢查它的'repr（）' – tripleee

您需要了解的是，在Python的模型中，unicode是一種數據，但utf-8是編碼。他們不是一回事。你正在閱讀你的原始文本，這顯然在utf-8;清理它，然後將其寫入新的語料庫而不指定編碼。所以你把它寫出來......誰知道什麼編碼。不要發現，只需清理並再次生成指定utf-8編碼的語料庫。

我希望你在Python 3中做到了這一切;如果沒有，就在這裏停下來，然後切換到Python 3.然後寫出這樣的語料庫：

output_file = open(output_path, "w", encoding="utf-8") 
output_file.write(output_corpus) 
output_file.close()

來源

2016-09-16 18:50:07 alexis

感謝您的澄清:) –

UnicodeDecodeError在NLTK中讀取自定義創建的語料庫時

回答

相關問題