儘管我正在執行str.decode（），但Python仍會拋出UnicodeEncodeError。爲什麼？

考慮這個功能：儘管我正在執行str.decode（），但Python仍會拋出UnicodeEncodeError。爲什麼？

def escape(text): 
    print repr(text) 
    escaped_chars = [] 
    for c in text: 
     try: 
      c = c.decode('ascii') 
     except UnicodeDecodeError: 
      c = '&{};'.format(htmlentitydefs.codepoint2name[ord(c)]) 
     escaped_chars.append(c) 
    return ''.join(escaped_chars)

應該由相應的htmlentitydefs逃避所有非ASCII字符。不幸的是蟒蛇拋出

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 0: ordinal not in range(128)

當變量text包含其repr()是u'Tam\xe1s Horv\xe1th'的字符串。我不使用str.encode()。我只使用str.decode()。我想念什麼？

來源

2011-12-21 Aufwind

Python有兩種類型的字符串：字符的字符串（該unicode型）和字節字符串（在str型）。你粘貼的代碼在字節串上操作。你需要一個類似的函數來處理字符串。

也許這：

def uescape(text): 
    print repr(text) 
    escaped_chars = [] 
    for c in text: 
     if (ord(c) < 32) or (ord(c) > 126): 
      c = '&{};'.format(htmlentitydefs.codepoint2name[ord(c)]) 
     escaped_chars.append(c) 
    return ''.join(escaped_chars)

我不知道是否任何功能是你真正需要的。如果是我，我會選擇UTF-8作爲結果文檔的字符編碼，以字符串形式處理文檔（無需擔心實體），並在將其交付給客戶端之前執行content.encode('UTF-8')作爲最後一步。根據所選擇的Web框架，您甚至可以將字符串直接傳遞給API，並讓它找出如何設置編碼。

來源

2011-12-21 14:39:28 wberry

你傳遞一個已經是unicode的字符串。因此，在Python可以調用decode之前，它必須對它進行實際編碼 - 並且默認情況下使用ASCII編碼進行編碼。

編輯添加這取決於你想要做什麼。如果您只是想將帶有非ASCII字符的unicode字符串轉換爲HTML編碼表示形式，則可以在一次調用中完成：text.encode('ascii', 'xmlcharrefreplace')。

來源

2011-12-21 14:07:04

你的意思是我應該抓住UnicodeEncodeError，也解決方案？ – Aufwind 2011-12-21 14:10:14

或者是我逃避人物廢話的方法？ – Aufwind 2011-12-21 14:13:23

-2

decode a str沒有意義。

我想你可以檢查ord(c)>127

來源

2011-12-21 14:17:49 kev

這是一個誤導性的錯誤報告，它來自於python處理de/encoding過程的方式。你試圖第二次解碼一個已經解碼的字符串，並且會混淆你的Python函數，這個函數將會讓你感到困惑！ ;-)編碼/解碼過程就我所知，由編解碼器模塊進行。這個誤導性的Exception消息的起源就在這裏。

您可以自行覈實：要麼

u'\x80'.encode('ascii')

或

u'\x80'.decode('ascii')

將拋出一個Unicode 編碼錯誤，其中

u'\x80'.encode('utf8')

不會，但

u'\x80'.decode('utf8')

再次會！

我猜你對編碼和解碼的含義感到困惑。說得簡單：

     decode    encode  
ByteString (ascii) --------> UNICODE ---------> ByteString (utf8) 
      codec            codec

但爲何有一個codec -argument爲decode方法？那麼，底層函數無法猜測ByteString使用哪個編解碼器編碼，因此提示codec作爲參數。如果未提供，則假定您的含義是隱含使用sys.getdefaultencoding()。

所以當你使用c.decode('ascii')你a）有一個（編碼）ByteString（這就是爲什麼你使用解碼）b）你想獲得一個unicode表示對象（這就是你使用的解碼）和c）編解碼器其中ByteString編碼爲ascii。

參見： https://stackoverflow.com/a/370199/1107807
http://docs.python.org/howto/unicode.html
http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
http://www.stereoplex.com/blog/python-unicode-and-unicodedecodeerror

來源

2011-12-21 15:51:30

非常感謝您的詳細解釋。 – Aufwind 2011-12-22 11:10:06

這個答案總是對我的作品時，我有這樣的問題：

def byteify(input): 
    ''' 
    Removes unicode encodings from the given input string. 
    ''' 
    if isinstance(input, dict): 
     return {byteify(key):byteify(value) for key,value in input.iteritems()} 
    elif isinstance(input, list): 
     return [byteify(element) for element in input] 
    elif isinstance(input, unicode): 
     return input.encode('utf-8') 
    else: 
     return input

從How to get string objects instead of Unicode ones from JSON in Python?

來源

2015-11-26 21:58:52 Blairg23

我發現this-site

 
reload(sys) 
sys.setdefaultencoding("latin-1") 

a = u'\xe1' 
print str(a) # no exception

來源

2016-07-22 16:13:48

儘管我正在執行str.decode（），但Python仍會拋出UnicodeEncodeError。爲什麼？

回答

相關問題