UnicodeDecodeError與在Python中詞幹在

我有一對夫婦的千言萬語

x = ['company', 'arriving', 'wednesday', 'and', 'then', 'beach', 'how', 'are', 'you', 'any', 'warmer', 'there', 'enjoy', 'your', 'day', 'follow', 'back', 'please', 'everyone', 'go', 'watch', 's', 'new', 'video', 'you', 'know', 'the', 'deal', 'make', 'sure', 'to', 'subscribe', 'and', 'like', '<http>', 'you', 'said', 'next', 'week', 'you', 'will', 'be', 'the', 'one', 'picking', 'me', 'up', 'lol', 'hindi', 'na', 'tl', 'huehue', 'that', 'works', 'you', 'said', 'everyone', 'of', 'us', 'my', 'little', 'cousin', 'keeps', 'asking', 'if', 'i', 'wanna', 'play', 'and', "i'm", 'like', 'yes', 'but', 'with', 'my', 'pals', 'not', 'you', "you're", 'welcome', 'pas', 'quand', 'tu', 'es', 'vers', '<num>', 'i', 'never', 'get', 'good', 'mornng', 'texts', 'sad', 'sad', 'moment', 'i', 'think', 'ima', 'go', 'get', 'a', 'glass', 'of', 'milk', 'ahah', 'for', 'the', 'first', 'time', 'i', 'actually', 'know', 'what', 'their', 'doing', 'd', 'thank', 'you', 'happy', 'birthday', 'hope', "you're"...........]

現在的名單，我已確認每個元素的類型，在此列表中是一個字符串

types = [] 
for word in x: 
    a.append(type(word)) 
print set(a) 

>>>set([<type 'str'>])

現在，我嘗試干擾每個字使用NLTK的搬運工莖幹

import nltk 
porter = nltk.PorterStemmer() 
stemmed_x = [porter.stem(word) for word in x]

而且我得到這個錯誤，這是明確的Ë詞幹包和Unicode莫名其妙：

File "/Library/Python/2.7/site-packages/nltk-3.0.0b2-py2.7.egg/nltk/stem/porter.py", line 633, in stem 
    stem = self.stem_word(word.lower(), 0, len(word) - 1) 
    File "/Library/Python/2.7/site-packages/nltk-3.0.0b2-py2.7.egg/nltk/stem/porter.py", line 591, in stem_word 
    word = self._step1ab(word) 
    File "/Library/Python/2.7/site-packages/nltk-3.0.0b2-py2.7.egg/nltk/stem/porter.py", line 289, in _step1ab 
    if word.endswith("ied"): 
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 12: ordinal not in range(128)

我已經嘗試了一切，用codecs.open，試圖每個單詞進行明確編碼爲utf8 - 仍然會產生同樣的錯誤。

請指教。

編輯：

我應該提到，這個代碼在我運行Ubuntu的PC上工作完美。我最近有一個macbook pro，我得到這個錯誤。我檢查了我的Mac上的終端設置，它被設置爲utf8編碼。

編輯2：

有趣的是，與這一段代碼，我已經分離出了問題的話：

for w in x: 
    try: 
     porter.stem(w) 
    except UnicodeDecodeError: 
     print w 

#sagittarius」 
#instadane… 
#bleedblue」 
#pr챕cieux 
#على_شرفة_الماضي 
#exploringsf… 
#fishing… 
#sindhubestfriend… 
#الإستعداد_لإنهيار_ال_سعود 
#jaredpreslar… 
#femalepains」 
#gobillings」 
#juicing… 
#instamood…

好像什麼，他們都有一個共同特點是多餘的標點符號，在字的結尾，除了#pr챕cieux

來源

2014-08-29 user1452494

您可能有一個多字節的UTF8字符。如果不是太長，您是否可以按照原樣從代碼中複製粘貼您的_full_數組定義？ – 2014-08-29 10:01:08

你有沒有拉丁字符？ – 2014-08-29 10:22:50

你在這裏混合了完全不同的字符集。如果可以的話，當您將數據拉入程序時，請將單詞屬於不同的語言（或更好的 - 不同的字符集）放在不同的列表中，因爲這會讓您的生活更輕鬆。然後，您可以將這些字符串的二進制文件解碼爲每個列表中適當的字符集。 – 2014-08-29 11:02:11

您可能有一個多字節的UTF8字符，因爲0xe2是16-bit codepoint encoded as UTF-8可能的第一個字節。當你的程序假設ASCII字符，有效的編碼值從0x00到0x7F，這個值被拒絕。

（因爲我從你的數據假設你只想用ASCII字符處理），您可能能夠通過簡單的理解，以查明「壞」的值，然後用手修復：

print [value for value in x if '\xe2' in x]

來源

2014-08-29 10:06:50

試過，返回空列表 – user1452494 2014-08-29 10:27:36

@ user1452494您肯定需要向我們提供您的原始數據。上傳你的文件的地方？ – 2014-08-29 10:29:50

請參閱編輯，我列出了導致錯誤的列表中的單詞 – user1452494 2014-08-29 10:32:08

使用word.decode('utf-8')應該解決這個錯誤。

import nltk 
porter = nltk.PorterStemmer() 
stemmed_x = [porter.stem(word.decode('utf-8')) for word in x]

來源

2017-11-23 05:03:08

UnicodeDecodeError與在Python中詞幹在

回答

相關問題