2013-04-20 88 views
2

以下內容是否足以阻止HTML元素內部的XSS?從HTML元素中防止XSS

function XSS_encode_html ($str) 
{ 
    $str = str_replace ('&', "&", $str); 
    $str = str_replace ('<', "&lt;", $str); 
    $str = str_replace ('>', "&gt;", $str); 
    $str = str_replace ('"', " &quot;", $str); 
    $str = str_replace ('\'', " &#x27;", $str); 
    $str = str_replace ('/', "&#x2F;", $str); 

    return $str; 
} 

如這裏提到的: -
https://www.owasp.org/index.php/Abridged_XSS_Prevention_Cheat_Sheet#RULE_.231_-_HTML_Escape_Before_Inserting_Untrusted_Data_into_HTML_Element_Content


編輯我沒有使用用htmlspecialchars(),因爲

: -

  1. 它不會改變/到&#x2F;
  2. '(單引號)當設置ENT_QUOTES時變爲'&#039;'(或&apos;)。

根據OWASP,'(單引號)應該成爲&#x27;叫我迂腐),並且不建議
&apos;,因爲它不是在HTML規範


+2

號它不是。爲什麼當你可以'htmlspecialchars()'的時候重新發明輪子? – 2013-04-20 21:17:47

+0

@MarcB:沒有,用htmlspecialchars()不遵循OWASP建議,PHP的一個子集, – 2013-04-20 21:29:59

+0

在將使用什麼樣的背景下,這樣轉義值? – Gumbo 2013-04-20 21:46:03

回答

2

元素的內容內,因爲它可能表示某個標記聲明的開始,無論是開始標記,結束標記還是註釋。所以那個字符應該總是被轉義。

其他字符不一定需要在元素的內容中轉義。

引號只需要在標籤內部轉義,尤其是用於包裝在同一引號內或根本不引用的屬性值時。同樣,標記聲明關閉分隔符>只需要在標記內部轉義,此處僅在使用未加引號的屬性值時才需要轉義。但是,escaping plain ampersands as well is recommended to avoid them being interpreted as start of a character reference by mistake

現在作爲取代/以及原因,這可以是由於SGML的特徵,標記語言HTML是從適合,這使得所謂null end-tag

怎麼看空結束標籤在實際工作中考慮其結合元件結合使用,其可以被定義爲:

<!ELEMENT ISBN - - CDATA --ISBN number-- > 

,而不用輸入ISBN號碼作爲

我們可以使用空結束標籤選項,在縮短的形式進入元素:

<ISBN/0 201 17535 5/ 

但是,我從來沒有見過這個功能曾經被任何瀏覽器來實現。 HTML的語法規則一直比SGML語法規則更嚴格。

另一個更可能的原因是,這樣的內容模型稱爲raw text elements (script and style),這是純文本有以下restriction

在原始文本和RCDATA元素的文本不能包含字符串「任何出現(U + 003C LESS-THAN SIGN,U + 002F SOLIDUS)後跟字符,不區分大小寫匹配元素的標籤名稱後跟一個「製表符」(U + 0009),「LF」(U + 000A) ),「FF」(U + 000C),「CR」(U + 000D),U + 0020 SPACE,「>」(U + 003E)或「/」(U + 002F)。

這說,原始的文本元素,如script內部的</script/發生將表示結束標記:

<script> 
alert(0</script/.exec("script").index) 
</script> 

雖然完全有效的JavaScript代碼,結束標籤將由</script/表示。但除此之外,/不會有任何傷害。如果您只允許在轉義HTML的情況下在JavaScript上下文中使用任意輸入,那麼您已經註定了。

順便說一下,這些字符是以什麼樣的character reference轉義的,無論是命名字符引用(即實體引用)還是數字字符引用(十進制或十六進制表示法)都沒有關係。他們都引用相同的字符。

2

您應該使用htmlspecialchars

$str = htmlspecialchars($str, ENT_QUOTES, 'UTF-8'); 

這是the documentation,b asically做什麼在你的功能,但它已經實施,它更乾淨。但是,它不轉換斜槓和反斜槓。

如果你想每一個字符轉換有一個名爲HTML實體,你可以使用htmlentities

$str = htmlentities($str, ENT_QUOTES, 'UTF-8'); 

在這裏是documented。如果你想要做的就是防止XSS攻擊和JS注入,我會推薦前者,因爲它的開銷要低得多。

+1

即使是大致如此,如果它已經存在,你應該利用現有的功能,它是PHP的*用htmlspecialchars()*只做什麼OWASP表明這裏的一個子集,它不把'/'這樣的例子。 – 2013-04-20 21:31:23

+1

這是真的,但沒有''「<>'我不認爲'/'是危險的,如果你想徹底的話'htmlentities()'比任何一個都好。 – 2013-04-20 21:33:04

0

您可以使用stripslashes()函數。

$str = stripslashes($str); 
0

這是一個很長的,但我覺得如果我不分享它,我會做一些傷害。所有的代碼都是直接從最新的Drupal穩定版的源代碼的各個部分中獲取的,並被編譯到一個區域中(如下所示)。非常有效的防止XSS攻擊的方法。

實例:

$html = file_get_contents('http://example.com'); 
$output = filter_xss($html); 
print $output; 

或者:

$html = file_get_contents('http://example.com'); 
// Allow only <ul></ul>, <li></li>, and <p></p> tags. 
$allowed_tags = array('ul', 'li', 'p'); 
$output = filter_xss($html, $allowed_tags); 
print $output; 

這裏的運行上述實施例所需要的代碼:the only character that can be harmful is the start-tag delimiter <

/** 
* Filters HTML to prevent cross-site-scripting (XSS) vulnerabilities. 
* 
* Based on kses by Ulf Harnhammar, see http://sourceforge.net/projects/kses. 
* For examples of various XSS attacks, see: http://ha.ckers.org/xss.html. 
* 
* This code does four things: 
* - Removes characters and constructs that can trick browsers. 
* - Makes sure all HTML entities are well-formed. 
* - Makes sure all HTML tags and attributes are well-formed. 
* - Makes sure no HTML tags contain URLs with a disallowed protocol (e.g. 
* javascript:). 
* 
* @param $string 
* The string with raw HTML in it. It will be stripped of everything that can 
* cause an XSS attack. 
* @param $allowed_tags 
* An array of allowed tags. 
* 
* @return 
* An XSS safe version of $string, or an empty string if $string is not 
* valid UTF-8. 
* 
* @see validate_utf8() 
* @ingroup sanitization 
*/ 
function filter_xss($string, $allowed_tags = array('a', 'em', 'strong', 'cite', 'blockquote', 'code', 'ul', 'ol', 'li', 'dl', 'dt', 'dd')) { 
    // Only operate on valid UTF-8 strings. This is necessary to prevent cross 
    // site scripting issues on Internet Explorer 6. 
    if (!validate_utf8($string)) { 
    return ''; 
    } 
    // Store the text format. 
    _filter_xss_split($allowed_tags, TRUE); 
    // Remove NULL characters (ignored by some browsers). 
    $string = str_replace(chr(0), '', $string); 
    // Remove Netscape 4 JS entities. 
    $string = preg_replace('%&\s*\{[^}]*(\}\s*;?|$)%', '', $string); 

    // Defuse all HTML entities. 
    $string = str_replace('&', '&amp;', $string); 
    // Change back only well-formed entities in our whitelist: 
    // Decimal numeric entities. 
    $string = preg_replace('/&amp;#([0-9]+;)/', '&#\1', $string); 
    // Hexadecimal numeric entities. 
    $string = preg_replace('/&amp;#[Xx]0*((?:[0-9A-Fa-f]{2})+;)/', '&#x\1', $string); 
    // Named entities. 
    $string = preg_replace('/&amp;([A-Za-z][A-Za-z0-9]*;)/', '&\1', $string); 

    return preg_replace_callback('% 
    (
    <(?=[^a-zA-Z!/]) # a lone < 
    |     # or 
    <!--.*?-->  # a comment 
    |     # or 
    <[^>]*(>|$)  # a string that starts with a <, up until the > or the end of the string 
    |     # or 
    >     # just a > 
    )%x', '_filter_xss_split', $string); 
} 

/** 
* Processes an HTML tag. 
* 
* @param $m 
* An array with various meaning depending on the value of $store. 
* If $store is TRUE then the array contains the allowed tags. 
* If $store is FALSE then the array has one element, the HTML tag to process. 
* @param $store 
* Whether to store $m. 
* 
* @return 
* If the element isn't allowed, an empty string. Otherwise, the cleaned up 
* version of the HTML element. 
*/ 
function _filter_xss_split($m, $store = FALSE) { 
    static $allowed_html; 

    if ($store) { 
    $allowed_html = array_flip($m); 
    return; 
    } 

    $string = $m[1]; 

    if (substr($string, 0, 1) != '<') { 
    // We matched a lone ">" character. 
    return '&gt;'; 
    } 
    elseif (strlen($string) == 1) { 
    // We matched a lone "<" character. 
    return '&lt;'; 
    } 

    if (!preg_match('%^<\s*(/\s*)?([a-zA-Z0-9]+)([^>]*)>?|(<!--.*?-->)$%', $string, $matches)) { 
    // Seriously malformed. 
    return ''; 
    } 

    $slash = trim($matches[1]); 
    $elem = &$matches[2]; 
    $attrlist = &$matches[3]; 
    $comment = &$matches[4]; 

    if ($comment) { 
    $elem = '!--'; 
    } 

    if (!isset($allowed_html[strtolower($elem)])) { 
    // Disallowed HTML element. 
    return ''; 
    } 

    if ($comment) { 
    return $comment; 
    } 

    if ($slash != '') { 
    return "</$elem>"; 
    } 

    // Is there a closing XHTML slash at the end of the attributes? 
    $attrlist = preg_replace('%(\s?)/\s*$%', '\1', $attrlist, -1, $count); 
    $xhtml_slash = $count ? ' /' : ''; 

    // Clean up attributes. 
    $attr2 = implode(' ', _filter_xss_attributes($attrlist)); 
    $attr2 = preg_replace('/[<>]/', '', $attr2); 
    $attr2 = strlen($attr2) ? ' ' . $attr2 : ''; 

    return "<$elem$attr2$xhtml_slash>"; 
} 

/** 
* Processes a string of HTML attributes. 
* 
* @return 
* Cleaned up version of the HTML attributes. 
*/ 
function _filter_xss_attributes($attr) { 
    $attrarr = array(); 
    $mode = 0; 
    $attrname = ''; 

    while (strlen($attr) != 0) { 
    // Was the last operation successful? 
    $working = 0; 

    switch ($mode) { 
     case 0: 
     // Attribute name, href for instance. 
     if (preg_match('/^([-a-zA-Z]+)/', $attr, $match)) { 
      $attrname = strtolower($match[1]); 
      $skip = ($attrname == 'style' || substr($attrname, 0, 2) == 'on'); 
      $working = $mode = 1; 
      $attr = preg_replace('/^[-a-zA-Z]+/', '', $attr); 
     } 
     break; 

     case 1: 
     // Equals sign or valueless ("selected"). 
     if (preg_match('/^\s*=\s*/', $attr)) { 
      $working = 1; $mode = 2; 
      $attr = preg_replace('/^\s*=\s*/', '', $attr); 
      break; 
     } 

     if (preg_match('/^\s+/', $attr)) { 
      $working = 1; $mode = 0; 
      if (!$skip) { 
      $attrarr[] = $attrname; 
      } 
      $attr = preg_replace('/^\s+/', '', $attr); 
     } 
     break; 

     case 2: 
     // Attribute value, a URL after href= for instance. 
     if (preg_match('/^"([^"]*)"(\s+|$)/', $attr, $match)) { 
      $thisval = filter_xss_bad_protocol($match[1]); 

      if (!$skip) { 
      $attrarr[] = "$attrname=\"$thisval\""; 
      } 
      $working = 1; 
      $mode = 0; 
      $attr = preg_replace('/^"[^"]*"(\s+|$)/', '', $attr); 
      break; 
     } 

     if (preg_match("/^'([^']*)'(\s+|$)/", $attr, $match)) { 
      $thisval = filter_xss_bad_protocol($match[1]); 

      if (!$skip) { 
      $attrarr[] = "$attrname='$thisval'"; 
      } 
      $working = 1; $mode = 0; 
      $attr = preg_replace("/^'[^']*'(\s+|$)/", '', $attr); 
      break; 
     } 

     if (preg_match("%^([^\s\"']+)(\s+|$)%", $attr, $match)) { 
      $thisval = filter_xss_bad_protocol($match[1]); 

      if (!$skip) { 
      $attrarr[] = "$attrname=\"$thisval\""; 
      } 
      $working = 1; $mode = 0; 
      $attr = preg_replace("%^[^\s\"']+(\s+|$)%", '', $attr); 
     } 
     break; 
    } 

    if ($working == 0) { 
     // Not well formed; remove and try again. 
     $attr = preg_replace('/ 
     ^
     (
     "[^"]*("|$)  # - a string that starts with a double quote, up until the next double quote or the end of the string 
     |    # or 
     \'[^\']*(\'|$)| # - a string that starts with a quote, up until the next quote or the end of the string 
     |    # or 
     \S    # - a non-whitespace character 
     )*    # any number of the above three 
     \s*    # any number of whitespaces 
     /x', '', $attr); 
     $mode = 0; 
    } 
    } 

    // The attribute list ends with a valueless attribute like "selected". 
    if ($mode == 1 && !$skip) { 
    $attrarr[] = $attrname; 
    } 
    return $attrarr; 
} 

/** 
* Processes an HTML attribute value and strips dangerous protocols from URLs. 
* 
* @param $string 
* The string with the attribute value. 
* @param $decode 
* (deprecated) Whether to decode entities in the $string. Set to FALSE if the 
* $string is in plain text, TRUE otherwise. Defaults to TRUE. 
* 
* @return 
* Cleaned up and HTML-escaped version of $string. 
*/ 
function filter_xss_bad_protocol($string, $decode = TRUE) { 
    // Get the plain text representation of the attribute value (i.e. its meaning). 
    if ($decode) { 

    $string = decode_entities($string); 
    } 
    return check_plain(strip_dangerous_protocols($string)); 
} 

/** 
* Strips dangerous protocols (e.g. 'javascript:') from a URI. 
* 
* @param $uri 
* A plain-text URI that might contain dangerous protocols. 
* 
* @return 
* A plain-text URI stripped of dangerous protocols. As with all plain-text 
* strings, this return value must not be output to an HTML page without 
* check_plain() being called on it. However, it can be passed to functions 
* expecting plain-text strings. 
* 
*/ 
function strip_dangerous_protocols($uri) { 
    static $allowed_protocols; 

    if (!isset($allowed_protocols)) { 
    $allowed_protocols = array_flip(array('ftp', 'http', 'https', 'irc', 'mailto', 'news', 'nntp', 'rtsp', 'sftp', 'ssh', 'tel', 'telnet', 'webcal')); 
    } 

    // Iteratively remove any invalid protocol found. 
    do { 
    $before = $uri; 
    $colonpos = strpos($uri, ':'); 
    if ($colonpos > 0) { 
     // We found a colon, possibly a protocol. Verify. 
     $protocol = substr($uri, 0, $colonpos); 
     // If a colon is preceded by a slash, question mark or hash, it cannot 
     // possibly be part of the URL scheme. This must be a relative URL, which 
     // inherits the (safe) protocol of the base document. 
     if (preg_match('![/?#]!', $protocol)) { 
     break; 
     } 
     // Check if this is a disallowed protocol. Per RFC2616, section 3.2.3 
     // (URI Comparison) scheme comparison must be case-insensitive. 
     if (!isset($allowed_protocols[strtolower($protocol)])) { 
     $uri = substr($uri, $colonpos + 1); 
     } 
    } 
    } while ($before != $uri); 

    return $uri; 
} 

/** 
* Encodes special characters in a plain-text string for display as HTML. 
* 
* Also validates strings as UTF-8 to prevent cross site scripting attacks on 
* Internet Explorer 6. 
* 
* @param $text 
* The text to be checked or processed. 
* 
* @return 
* An HTML safe version of $text, or an empty string if $text is not 
* valid UTF-8. 
* 
* @see validate_utf8() 
* @ingroup sanitization 
*/ 
function check_plain($text) { 
    return htmlspecialchars($text, ENT_QUOTES, 'UTF-8'); 
} 

/** 
* Decodes all HTML entities (including numerical ones) to regular UTF-8 bytes. 
* 
* Double-escaped entities will only be decoded once ("&amp;lt;" becomes "&lt;" 
* , not "<"). Be careful when using this function, as decode_entities can 
* revert previous sanitization efforts (&lt;script&gt; will become <script>). 
* 
* @param $text 
* The text to decode entities in. 
* 
* @return 
* The input $text, with all HTML entities decoded once. 
*/ 
function decode_entities($text) { 
    return html_entity_decode($text, ENT_QUOTES, 'UTF-8'); 
} 

/** 
* Checks whether a string is valid UTF-8. 
* 
* All functions designed to filter input should use validate_utf8 
* to ensure they operate on valid UTF-8 strings to prevent bypass of the 
* filter. 
* 
* When text containing an invalid UTF-8 lead byte (0xC0 - 0xFF) is presented 
* as UTF-8 to Internet Explorer 6, the program may misinterpret subsequent 
* bytes. When these subsequent bytes are HTML control characters such as 
* quotes or angle brackets, parts of the text that were deemed safe by filters 
* end up in locations that are potentially unsafe; An onerror attribute that 
* is outside of a tag, and thus deemed safe by a filter, can be interpreted 
* by the browser as if it were inside the tag. 
* 
* The function does not return FALSE for strings containing character codes 
* above U+10FFFF, even though these are prohibited by RFC 3629. 
* 
* @param $text 
* The text to check. 
* 
* @return 
* TRUE if the text is valid UTF-8, FALSE if not. 
*/ 
function validate_utf8($text) { 
    if (strlen($text) == 0) { 
    return TRUE; 
    } 
    // With the PCRE_UTF8 modifier 'u', preg_match() fails silently on strings 
    // containing invalid UTF-8 byte sequences. It does not reject character 
    // codes above U+10FFFF (represented by 4 or more octets), though. 
    return (preg_match('/^./us', $text) == 1); 
} 
0

對於perl腳本或者CGI,你可以使用HTML::Entities

use HTML::Entities; 

$str = encode_entities($str, '<>&"');