的UnicodeDecodeError：「UTF-8」編解碼器不能在1266位置解碼字節0xba：無效的起始字節

我努力訓練使用scikit一些文本數據。同樣的代碼被其他電腦上使用沒有任何錯誤，但在我的系統提示錯誤：的UnicodeDecodeError：「UTF-8」編解碼器不能在1266位置解碼字節0xba：無效的起始字節

File "/root/Desktop/karim/svn/questo-anso/v5/trials/classify/domain_detection_final/test_classifier_temp.py", line 130, in trainClassifier 
    X_train = self.vectorizer.fit_transform(self.data_train.data) 
    File "/root/Desktop/karim/software/scikit-learn-0.15.1/sklearn/feature_extraction/text.py", line 1270, in fit_transform 
    X = super(TfidfVectorizer, self).fit_transform(raw_documents) 
    File "/root/Desktop/karim/software/scikit-learn-0.15.1/sklearn/feature_extraction/text.py", line 808, in fit_transform 
    vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary) 
    File "/root/Desktop/karim/software/scikit-learn-0.15.1/sklearn/feature_extraction/text.py", line 741, in _count_vocab 
    for feature in analyze(doc): 
    File "/root/Desktop/karim/software/scikit-learn-0.15.1/sklearn/feature_extraction/text.py", line 233, in <lambda> 
    tokenize(preprocess(self.decode(doc))), stop_words) 
    File "/root/Desktop/karim/software/scikit-learn-0.15.1/sklearn/feature_extraction/text.py", line 111, in decode 
    doc = doc.decode(self.encoding, self.decode_error) 
    File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode 
    return codecs.utf_8_decode(input, errors, True) 
UnicodeDecodeError: 'utf8' codec can't decode byte 0xba in position 1266: invalid start byte

我已經籤類似的主題，但沒有幫助。

UPDATE：

self.data_train = self.fetch_data(cache, subset='train') 
if not os.path.exists(self.root_dir+"/autocreated/vectorizer.txt"): 
       self.vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, 
               stop_words='english') 
       start_time = time() 
       print("Transforming the dataset") 
       X_train = self.vectorizer.fit_transform(self.data_train.data) // Error is here 
       joblib.dump(self.vectorizer, self.root_dir+"/autocreated/vectorizer.txt")

來源

2014-09-01 user123

0xba確實是一個無效的起始字節，有什麼問題？ – 2014-09-01 06:23:28

編碼文本即'text.encode（「utf-8」）'和審查文本，你可能會得到線索 – MaNKuR 2014-09-01 06:33:24

@nm：即使我不知道，編碼是好的，但不知道爲什麼它顯示的編碼錯誤 – user123 2014-09-01 06:36:20

你的文件在ISO-8869-1實際編碼，而不是UTF-8。您需要先對其進行正確解碼，然後才能對其重新編碼。

0xBA是ISO-8869-1中的numero sign（º）。

來源

2014-09-01 06:53:28

感謝兄弟。你的意思是解碼我用於訓練目的的數據嗎？我正在使用20條新聞組數據http://kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.html。我也嘗試從維基百科複製文本，但即使它給出了同樣的錯誤。 – user123 2014-09-01 07:00:06

，我還檢查了字符編碼是'UTF-8' – user123 2014-09-01 07:05:51

......以及你是如何檢查它的文本？ – 2014-09-01 07:07:42

有在處理訓練數據的問題。有一件事解決了我的問題ignoring error使用decode_error='ignore'，可能有一些其他解決方案。

self.vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,stop_words='english',decode_error='ignore')

來源

2014-09-01 08:58:25 user123

這是一個糟糕的解決方案。您現在只是隱藏了一個事實，即您未能創建正確的輸入文件。 – 2014-09-01 19:20:25

的UnicodeDecodeError：「UTF-8」編解碼器不能在1266位置解碼字節0xba：無效的起始字節

回答

相關問題