在Python

轉義特殊的HTML字符我有哪裏像'或"或&（...）特殊字符可以出現的字符串。在字符串：在Python

string = """ Hello "XYZ" this 'is' a test & so on """

我怎麼能自動跳脫每一個特殊字符，讓我得到這個：

string = " Hello &quot;XYZ&quot; this &#39;is&#39; a test &amp; so on "

來源

2010-01-16 creativz

在Python 3.2，你可以使用html.escape function，例如

>>> string = """ Hello "XYZ" this 'is' a test & so on """ 
>>> import html 
>>> html.escape(string) 
' Hello &quot;XYZ&quot; this &#x27;is&#x27; a test &amp; so on '

對於早期版本的Python，檢查http://wiki.python.org/moin/EscapingHtml：

附帶Python中的cgi module有一個escape() function：
import cgi 

s = cgi.escape("""& < >""") # s = "&amp; &lt; &gt;" 
然而，這並不轉義字符超出&， <和>。如果它被用作cgi.escape(string_to_escape, quote=True)，它也逃脫"。

這裏是一個小片段，讓你逃脫引號和撇號，以及：
html_escape_table = { 
    "&": "&amp;", 
    '"': "&quot;", 
    "'": "&apos;", 
    ">": "&gt;", 
    "<": "&lt;", 
    } 

def html_escape(text): 
    """Produce entities within text.""" 
    return "".join(html_escape_table.get(c,c) for c in text) 
您還可以使用escape() from xml.sax.saxutils逃脫HTML。這個函數應該執行得更快。相同模塊的unescape()函數可以傳遞相同的參數來解碼字符串。
from xml.sax.saxutils import escape, unescape 
# escape() and unescape() takes care of &, <and>. 
html_escape_table = { 
    '"': "&quot;", 
    "'": "&apos;" 
} 
html_unescape_table = {v:k for k, v in html_escape_table.items()} 

def html_escape(text): 
    return escape(text, html_escape_table) 

def html_unescape(text): 
    return unescape(text, html_unescape_table) 

來源

2010-01-16 12:30:29 kennytm

謝謝你'報價= TRUE;在'CGI。轉義' – sidx 2015-12-29 11:16:12

請注意，您的一些替代品不符合HTML標準。例如：https：//www.w3.org/TR/xhtml1/#C_16而不是'，使用'我想其他一些人被添加到HTML4標準，但那不是。 – leetNightshade 2017-11-30 00:32:54

的cgi.escape方法特別charecters轉換爲有效的HTML標籤

import cgi 
original_string = 'Hello "XYZ" this \'is\' a test & so on ' 
escaped_string = cgi.escape(original_string, True) 
print original_string 
print escaped_string

將導致

Hello "XYZ" this 'is' a test & so on 
Hello &quot;XYZ&quot; this 'is' a test &amp; so on

可選的第二放慢參數上cgi.escape逃脫的報價。默認情況下，他們都沒有逃過

來源

2010-01-16 12:34:34

我不明白爲什麼cgi.escape對轉換引號非常敏感，並且完全忽略了單引號。 – 2010-01-16 13:11:24

因爲引號不需要在PCDATA中轉義，所以它們*需要在屬性中轉義（這通常使用雙引號分隔符），前者比後者更普遍。一般來說，這是一本教科書90％的解決方案（更像是> 99％）。如果你必須保存每一個最後一個字節，並且希望它能動態確定哪種類型的引用是這樣做的，請使用'xml.sax.saxutils.quoteattr（）'。 – 2010-01-16 13:16:29

簡單的字符串函數會做到這一點：

def escape(t): 
    """HTML-escape the text in `t`.""" 
    return (t 
     .replace("&", "&amp;").replace("<", "&lt;").replace(">", "&gt;") 
     .replace("'", "&#39;").replace('"', "&quot;") 
     )

在此線程的其他答案有小問題：由於某種原因，cgi.escape方法忽略單引號，和你需要明確要求它做雙引號。鏈接的wiki頁面全部五個，但使用不是HTML實體的XML實體'。

這個代碼函數做所有五個所有的時間，使用HTML標準的實體。

來源

2010-01-16 13:10:04

這裏其他的答案將有助於如您列出的字符和其他幾個人。但是，如果您還想將其他所有內容轉換爲實體名稱，則必須執行其他操作。例如，如果á需求轉換爲á，既不cgi.escape也不html.escape將幫助你。你會想這樣做，使用html.entities.entitydefs，這只是一個字典。（下面的代碼爲Python 3.x的製作，但有以使其與2.x的兼容部分試圖給你一個想法）：

# -*- coding: utf-8 -*- 

import sys 

if sys.version_info[0]>2: 
    from html.entities import entitydefs 
else: 
    from htmlentitydefs import entitydefs 

text=";\"áèïøæỳ" #This is your string variable containing the stuff you want to convert 
text=text.replace(";", "$ஸ$") #$ஸ$ is just something random the user isn't likely to have in the document. We're converting it so it doesn't convert the semi-colons in the entity name into entity names. 
text=text.replace("$ஸ$", "&semi;") #Converting semi-colons to entity names 

if sys.version_info[0]>2: #Using appropriate code for each Python version. 
    for k,v in entitydefs.items(): 
     if k not in {"semi", "amp"}: 
      text=text.replace(v, "&"+k+";") #You have to add the & and ; manually. 
else: 
    for k,v in entitydefs.iteritems(): 
     if k not in {"semi", "amp"}: 
      text=text.replace(v, "&"+k+";") #You have to add the & and ; manually. 

#The above code doesn't cover every single entity name, although I believe it covers everything in the Latin-1 character set. So, I'm manually doing some common ones I like hereafter: 
text=text.replace("ŷ", "&ycirc;") 
text=text.replace("Ŷ", "&Ycirc;") 
text=text.replace("ŵ", "&wcirc;") 
text=text.replace("Ŵ", "&Wcirc;") 
text=text.replace("ỳ", "&#7923;") 
text=text.replace("Ỳ", "&#7922;") 
text=text.replace("ẃ", "&wacute;") 
text=text.replace("Ẃ", "&Wacute;") 
text=text.replace("ẁ", "&#7809;") 
text=text.replace("Ẁ", "&#7808;") 

print(text) 
#Python 3.x outputs: &semi;&quot;&aacute;&egrave;&iuml;&oslash;&aelig;&#7923; 
#The Python 2.x version outputs the wrong stuff. So, clearly you'll have to adjust the code somehow for it.

來源

2014-06-23 19:20:41 Shule

回答

相關問題