如何在Python 3.1的字符串中隱藏HTML實體？

我看了四周，只發現python 2.6和更早版本的解決方案，沒有關於如何在python 3.X中做到這一點。（我只能訪問Win7的盒子。）如何在Python 3.1的字符串中隱藏HTML實體？

我必須能夠在3.1中做到這一點，最好沒有外部庫。目前，我已經安裝了httplib2並可以訪問命令提示curl（這就是我如何獲取頁面的源代碼）。不幸的是，curl並不解碼html實體，據我所知，我找不到在文檔中解碼它的命令。

是的，我試圖讓美麗的湯來工作，很多時候在3.X沒有成功。如果你可以在MS Windows環境下提供有關如何讓它在python 3中運行的EXPLICIT指令，我將非常感激。

所以，要清楚，我需要把這樣的字符串變成：Suzy & John這樣的字符串：「Suzy & John」。

來源

2010-03-02 Sho Minamimoto

121

您可以使用功能html.unescape：

在Python3.4 +（感謝JF塞巴斯蒂安的更新）：

import html 
html.unescape('Suzy &amp; John') 
# 'Suzy & John' 

html.unescape('&quot;') 
# '"'

在Python3.3以上：

import html.parser  
html.parser.HTMLParser().unescape('Suzy &amp; John')

在Python2：

import HTMLParser 
HTMLParser.HTMLParser().unescape('Suzy &amp; John')

來源

2010-03-02 03:00:32 unutbu

真棒！但是，我發現只有一些角色可以忽略。例如，＆符號仍然逃脫。你能解釋這是爲什麼嗎？我如何避開這些角色？ – 2010-03-02 03:11:44

@Sho南本：我添加的例子。希望能幫助到你？ – unutbu 2010-03-02 03:16:51

是的，我明白了，謝謝！ – 2010-03-03 22:05:41

的Python 3.x的具有html.entities太

來源

2010-03-02 03:01:41 YOU

我不知道這是否是一個建在圖書館或沒有，但它看起來像你需要什麼，並支持3.1。

來自：http://docs.python.org/3.1/library/xml.sax.utils.html?highlight=html%20unescape

xml.sax.saxutils.unescape（數據，實體= {}） UNESCAPE '&'， '<'，和 '>' 在數據的字符串。

雅各

來源

2010-03-02 03:02:19 TheJacobTaylor

您可以使用xml.sax.saxutils.unescape用於此目的。該模塊包含在Python標準庫中，並且可以在Python 2.x和Python 3.x之間移植。

>>> import xml.sax.saxutils as saxutils 
>>> saxutils.unescape("Suzy &amp; John") 
'Suzy & John'

來源

2010-03-02 03:03:50

似乎是不完整的，「＆euml」沒有這個雖然解碼它使用htmlparser – bcoughlan 2013-01-02 12:33:33

顯然我沒有足夠高的聲望去做任何事情，但發佈這個。 unutbu的答案並不能避免引用。我發現，做的唯一的事情就是這個功能

 
import re 
from htmlentitydefs import name2codepoint as n2cp 

def decodeHtmlentities(string): 
    def substitute_entity(match):   
     ent = match.group(2) 
     if match.group(1) == "#": 
      return unichr(int(ent)) 
     else: 
      cp = n2cp.get(ent) 
      if cp: 
       return unichr(cp) 
      else: 
       return match.group() 
    entity_re = re.compile("&(#?)(\d{1,5}|\w{1,8});") 
    return entity_re.subn(substitute_entity, string)[0]

這是我從這個page了。

來源

2010-09-26 07:09:13

在我的情況下，我有一個在as3轉義函數中轉義的html字符串。經過一個小時的谷歌搜索沒有發現任何有用的東西，所以我編寫了這個recusrive函數來滿足我的需求。這是，

def unescape(string): 
    index = string.find("%") 
    if index == -1: 
     return string 
    else: 
     #if it is escaped unicode character do different decoding 
     if string[index+1:index+2] == 'u': 
      replace_with = ("\\"+string[index+1:index+6]).decode('unicode_escape') 
      string = string.replace(string[index:index+6],replace_with) 
     else: 
      replace_with = string[index+1:index+3].decode('hex') 
      string = string.replace(string[index:index+3],replace_with) 
     return unescape(string)

編輯-1新增的功能來處理Unicode字符。

來源

2012-10-25 12:52:39 Simanas

如何在Python 3.1的字符串中隱藏HTML實體？

回答

相關問題