的preg_match的unicode解析

我想匹配的子集的Unicode/UTF-8字符的，（黃色標記這裏http://solomon.ie/unicode/），從我的研究，我想出了這一點：的preg_match的unicode解析

// ensure it's valid unicode/get rid of invalid UTF8 chars 
$text = iconv("UTF-8","UTF-8//IGNORE",$text); 

// and just allow a basic english...ish.. chars through - no controls, chinese etc 
$match_list = "\x{09}\x{0a}\x{0d}\x{20}-\x{7e}"; // basic ascii chars plus CR,LF and TAB 
$match_list .= "\x{a1}-\x{ff}"; // extended latin 1 chars excluding control chars 
$match_list .= "\x{20ac}"; // euro symbol 

if (preg_match("/[^$match_list]/u", $text)) 
    $error_text_array[] = "<b>INVALID UNICODE characters</b>";

測試似乎表明它按預期工作，但作爲uniocde的新手，如果有人能夠發現我忽略的任何漏洞，我將不勝感激。

我可以證實，十六進制範圍匹配的Unicode碼點，而不是實際的十六進制值（即x20ac代替xe282ac對於歐元符號是正確的）？

而且我可以混合文字字符和十六進制值一樣的preg_match（「/ [^ 0-9 \ X {} 20AC]/U」，$文字）？

感謝，凱文

注意，我嘗試過這個問題，但它被關閉了 - 「更適合codereview.stackexchange.com」，但沒有任何反應，以便希望這是確定在更再試一次更簡潔的格式。

來源

2012-04-28 KevInSol

我創建了一個包裝，以測試你的代碼，我認爲這是在過濾你需要的字符安全，但您的代碼將導致E_NOTICE的時候才發現無效的UTF-8字符。所以我認爲你應該在iconv行的開頭添加@來抑制通知。

對於第二個問題，它是確定混合文字字符和十六進制值。你也可以自己嘗試。 :)

<?php 
function generatechar($char) 
{ 
    $char = str_pad(dechex($char), 4, '0', STR_PAD_LEFT); 
    $unicodeChar = '\u'.$char; 
    return json_decode('"'.$unicodeChar.'"'); 
} 
function test($text) 
{ 
    // ensure it's valid unicode/get rid of invalid UTF8 chars 
    @$text = iconv("UTF-8","UTF-8//IGNORE",$text); //Add @ to surpress warning 
    // and just allow a basic english...ish.. chars through - no controls, chinese etc 
    $match_list = "\x{09}\x{0a}\x{0d}\x{20}-\x{7e}"; // basic ascii chars plus CR,LF and TAB 
    $match_list .= "\x{a1}-\x{ff}"; // extended latin 1 chars excluding control chars 
    $match_list .= "\x{20ac}"; // euro symbol 

    if (preg_match("/[^$match_list]+/u", $text) ) 
     return false; 

    if(strlen($text) == 0) 
     return false; //For testing purpose! 
    return true; 
} 

for($n=0;$n<65536;$n++) 
{ 
    $c = generatechar($n); 
    if(test($c)) 
     echo $n.':'.$c."\n"; 
}

來源

2012-04-28 17:13:37 chalet16

chalet16 - 許多感謝。回到辦公室後，我會玩你的測試代碼。我已經混合字符按我的例子，它似乎好的工作，但只是檢查成爲你的努力，凱文 – KevInSol 2012-04-28 17:23:57

喜肯定:)再次非常感謝，我得試試這個，現在，它的工作主要是作爲預期除了我似乎將u + d800改回u + dfff。我沒有看到我要去哪裏錯了。另外我注意到你添加了+ metachar到我的正則表達式 - 是否需要當匹配不在列表中的螞蟻char？ – KevInSol 2012-04-30 11:40:48

我剛纔看到u + d800到u + dfff是代理對 - 但是這似乎在UTF-16中使用，而不是8？ iconv應該剝離它們嗎？ – KevInSol 2012-04-30 11:54:04

的preg_match的unicode解析

回答

相關問題