如何使用Python讀取utf-8編碼的文本文件

我需要分析泰米爾文本文件（UTF-8編碼）。我在接口IDLE上使用Python的nltk包。當我嘗試閱讀界面上的文本文件時，這是我得到的錯誤。我如何避免這種情況？如何使用Python讀取utf-8編碼的文本文件

corpus = open('C:\\Users\\Customer\\Desktop\\DISSERTATION\\ettuthokai.txt').read() 

Traceback (most recent call last): 
    File "<pyshell#2>", line 1, in <module> 
    corpus = open('C:\\Users\\Customer\\Desktop\\DISSERTATION\\ettuthokai.txt').read() 
    File "C:\Users\Customer\AppData\Local\Programs\Python\Python35-32\lib\encodings\cp1252.py", line 23, in decode 
    return codecs.charmap_decode(input,self.errors,decoding_table)[0] 
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 33: character maps to <undefined>

來源

2016-12-01 Ramprashanth

我還沒有完全閱讀你的問題，但是...如果你有個字節的負載，你可以使用'your_bytes.decode（「UTF-8」）'將它們解碼爲一個字符串。 – byxor

哪個Python版本？ –

@AntonisChristofides - 從回溯中，我推斷Python3。 –

由於您使用Python 3，只需添加encoding參數open()：

corpus = open('C:\\Users\\Customer\\Desktop\\DISSERTATION\\ettuthokai.txt', 
       encoding='utf-8').read()

來源

2016-12-01 19:14:36

只適用於Python 3+。對於Python 2，使用'codecs.open'。 –

如何使用Python讀取utf-8編碼的文本文件

回答

相關問題