字符串處理錯誤：UnicodeDecodeError：'utf8'編解碼器無法解碼

我試圖分析一系列頻率的密碼。我的腳本正在處理其他輸入媒體，但是看起來我的當前數據集中存在一些不好的字符。我怎樣才能解決「壞」數據？字符串處理錯誤：UnicodeDecodeError：'utf8'編解碼器無法解碼

import re 
import collections 
words = re.findall('\w+', open('rockyou.txt').read().lower()) 
a=collections.Counter(words).most_common(50) 
for word in a: 
    print(word)

然後我得到的錯誤：

Traceback (most recent call last): 
    File "shakecount.py", line 3, in <module> 
    words = re.findall('\w+', open('rockyou.txt').read().lower().ASCII) 
    File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/codecs.py", line 300, in decode 
    (result, consumed) = self._buffer_decode(data, self.errors, final) 
UnicodeDecodeError: 'utf8' codec can't decode byte 0xf1 in position 5079963: invalid continuation byte

任何想法？

來源

2012-04-11 AlphaTested

您的代碼並不完全符合您的錯誤（我假設嘗試調試？），但您的文本文件不是UTF-8。

您需要手動指定的編碼，與我最好的猜測是latin-1：

words = re.findall('\w+', open('rockyou.txt', encoding='latin-1').read().lower())

，如果你想繼續，儘管錯誤

，你可以通過errors='ignore'或errors='replace'到open。

來源

2012-04-11 21:31:55 agf

以上是有益的，但並沒有最終解決問題，我跑到更多的希臘錯誤（我是編程新手）。我最終在文本編輯器中打開了單詞列表，並重新編譯爲utf-8格式，然後運行。感謝agf的幫助！ – AlphaTested 2012-04-12 07:01:07

@AlphaTested如果你不知道編碼，另一種方法是使用[chardet]（http://pypi.python.org/pypi/chardet）來檢測它。 – agf 2012-04-12 07:04:00

啊，我明白了。謝謝。 – AlphaTested 2012-04-12 07:37:42

字符串處理錯誤：UnicodeDecodeError：'utf8'編解碼器無法解碼

回答

相關問題