如何修復UnicodeDecodeError：'ascii'編解碼器無法解碼字節？

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)

這是我在嘗試清理使用spaCy從html頁面提取的名稱列表時遇到的錯誤。如何修復UnicodeDecodeError：'ascii'編解碼器無法解碼字節？

我的代碼：

import urllib 
import requests 
from bs4 import BeautifulSoup 
import spacy 
from spacy.en import English 
from __future__ import unicode_literals 
nlp_toolkit = English() 
nlp = spacy.load('en') 

def get_text(url): 
    r = requests.get(url) 
    soup = BeautifulSoup(r.content, "lxml") 

    # delete unwanted tags: 
    for s in soup(['figure', 'script', 'style']): 
     s.decompose() 

    # use separator to separate paragraphs and subtitles! 
    article_soup = [s.get_text(separator="\n", strip=True) for s in soup.find_all('div', {'class': 'story-body__inner'})] 

    text = ''.join(article_soup) 
    return text 

# using spacy 
def get_names(all_tags): 
    names=[] 
    for ent in all_tags.ents: 
     if ent.label_=="PERSON": 
      names.append(str(ent)) 
    return names 

def cleaning_names(names): 
    new_names = [s.strip("'s") for s in names] # remove 's' from names 
    myset = list(set(new_names)) #remove duplicates 
    return myset 

def main(): 
    url = "http://www.bbc.co.uk/news/uk-politics-39784164" 
    text=get_text(url) 
    text=u"{}".format(text) 
    all_tags = nlp(text) 
    names = get_person(all_tags) 
    print "names:" 
    print names 
    mynewlist = cleaning_names(names) 
    print mynewlist 

if __name__ == '__main__': 
    main()

對於這個特定的網址，我得到的名字的名單，其中包括像£或字符$：

['Nick Clegg', 'Brexit', '\xc2\xa359bn', 'Theresa May', 'Brexit', 'Brexit', 'Mr Clegg', 'Mr Clegg', 'Mr Clegg', 'Brexit', 'Mr Clegg', 'Theresa May']

然後錯誤：

Traceback (most recent call last) <ipython-input-19-8582e806c94a> in <module>() 
    47 
    48 if __name__ == '__main__': 
---> 49  main() 

<ipython-input-19-8582e806c94a> in main() 
    43  print "names:" 
    44  print names 
---> 45  mynewlist = cleaning_names(names) 
    46  print mynewlist 
    47 

<ipython-input-19-8582e806c94a> in cleaning_names(names) 
    31 
    32 def cleaning_names(names): 
---> 33  new_names = [s.strip("'s") for s in names] # remove 's' from names 
    34  myset = list(set(new_names)) #remove duplicates 
    35  return myset 

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)

我嘗試了修復unicode的不同方法（包括sys.setdefaultencoding('utf8')），沒有任何工作。我希望以前有人有同樣的問題，並能夠提出修復建議。謝謝！

來源

2017-05-07 aviss

清理你的回溯。這是不可讀的。 – Kanak

不知道錯誤發生的位置，並且不會因爲庫而重現。如果您手動修復名稱列表，它會起作用嗎？ – handle

您是否檢查過**相關的問題，如右圖所示？ – handle

我終於修復了我的代碼。我很驚訝它看起來有多容易，但是花了我很長時間纔到達那裏，我看到很多人對同樣的問題感到困惑，所以我決定發佈我的答案。

在通過名稱進行進一步清理之前添加這個小函數解決了我的問題。

def decode(names):   
    decodednames = [] 
    for name in names: 
     decodednames.append(unicode(name, errors='ignore')) 
    return decodednames

SpaCy仍然認爲£590億是一個人，但它的確定和我在一起，我可以在以後處理這個在我的代碼。

工作代碼：

import urllib 
import requests 
from bs4 import BeautifulSoup 
import spacy 
from spacy.en import English 
from __future__ import unicode_literals 
nlp_toolkit = English() 
nlp = spacy.load('en') 

def get_text(url): 
    r = requests.get(url) 
    soup = BeautifulSoup(r.content, "lxml") 

    # delete unwanted tags: 
    for s in soup(['figure', 'script', 'style']): 
     s.decompose() 

    # use separator to separate paragraphs and subtitles! 
    article_soup = [s.get_text(separator="\n", strip=True) for s in soup.find_all('div', {'class': 'story-body__inner'})] 

    text = ''.join(article_soup) 
    return text 

# using spacy 
def get_names(all_tags): 
    names=[] 
    for ent in all_tags.ents: 
     if ent.label_=="PERSON": 
      names.append(str(ent)) 
    return names 

def decode(names):   
    decodednames = [] 
    for name in names: 
     decodednames.append(unicode(name, errors='ignore')) 
    return decodednames 

def cleaning_names(names): 
    new_names = [s.strip("'s") for s in names] # remove 's' from names 
    myset = list(set(new_names)) #remove duplicates 
    return myset 

def main(): 
    url = "http://www.bbc.co.uk/news/uk-politics-39784164" 
    text=get_text(url) 
    text=u"{}".format(text) 
    all_tags = nlp(text) 
    names = get_person(all_tags) 
    print "names:" 
    print names 
    decodednames = decode(names) 
    mynewlist = cleaning_names(decodednames) 
    print mynewlist 

if __name__ == '__main__': 
    main()

這給了我這個沒有任何錯誤：

names: ['Nick Clegg', 'Brexit', '\xc2\xa359bn', 'Theresa May', 'Brexit', 'Brexit', 'Mr Clegg', 'Mr Clegg', 'Mr Clegg', 'Brexit', 'Mr Clegg', 'Theresa May'] [u'Mr Clegg', u'Brexit', u'Nick Clegg', u'59bn', u'Theresa May']

來源

2017-05-09 08:37:27 aviss

當然，你可以簡單地忽略所有不是ASCII的字符，這很容易。它可能會在稍後回來咬你。進行轉換的正確方法是讓庫爲你做，因爲他們知道適當的編碼，你不知道。 –

當您得到'ascii'編碼解碼器的解碼錯誤時，這通常表示在需要Unicode字符串的上下文中正在使用字節字符串（在Python 2中，Python 3根本不會允許）。

由於您導入了from __future__ import unicode_literals，字符串"'s"是Unicode。這意味着您嘗試使用strip的字符串也必須是Unicode字符串。解決這個問題，你不會再犯這個錯誤了。

來源

2017-05-07 22:36:35

這正是我想要解決的問題。 – aviss

@aviss你有一個答案，因爲刪除，這告訴你如何解決它。我不太瞭解'request'或'BeautifulSoup'來了解具體內容。 –

由於@MarkRansom評論忽略了非ASCII字符會咬你回來。

首先來看看

另外，請注意這是一個反模式：Why should we NOT use sys.setdefaultencoding("utf-8") in a py script?

最簡單的解決方法就是使用Python3和會減少一些痛苦

>>> import requests 
>>> from bs4 import BeautifulSoup 
>>> import spacy 
>>> nlp = spacy.load('en') 

>>> url = "http://www.bbc.co.uk/news/uk-politics-39784164" 
>>> html = requests.get(url).content 
>>> bsoup = BeautifulSoup(html, 'html.parser') 
>>> text = '\n'.join(p.text for d in bsoup.find_all('div', {'class': 'story-body__inner'}) for p in d.find_all('p') if p.text.strip()) 

>>> import spacy 
>>> nlp = spacy.load('en') 
>>> doc = nlp(text) 
>>> names = [ent for ent in doc.ents if ent.ent_type_ == 'PERSON']

來源

2017-05-17 13:59:18 alvas

如何修復UnicodeDecodeError：'ascii'編解碼器無法解碼字節？

回答

相關問題