從一個字符串中刪除PHP標記與Python

我想從一個字符串從一個字符串中刪除PHP標記與Python

content = re.sub('<\?php(.*)\?>', '', content)

刪除PHP代碼似乎工作的單行PHP標籤OK，但是當一個PHP標籤後關閉一些行，就不能抓住它。任何人都可以幫忙嗎？

來源

2012-04-23 wtayyeb

我認爲這超出了正則表達式的能力，你需要一個實際的解析器。例如：'<？php echo''？>'，'<？php if（1）：？> PHP '，'<？ echo'shorttags！'？>'，''等 – 2012-04-23 21:20:45

@FrancisAvila剛刪除將完成我的工作！ – wtayyeb 2012-04-23 21:39:37

不，它不會。你認爲它會，但它不會。試試這些測試用例的正則表達式。還要記住，你可以在php中省略最後的'？>'。 – 2012-04-23 21:44:35

如果你只是想處理簡單的情況下，一個簡單的正則表達式將正常工作。 Python正則表達式中的*?運算符提供了最小匹配。

import re 

_PHP_TAG = re.compile(r'<\?php.*?\?>', re.DOTALL) 
def strip_php(content): 
    return _PHP_TAG.sub('', content) 

INPUT = """ 
Simple: <?php echo $a ?>. 
Two on one line: <?php echo $a ?>, <?php echo $b ?>. 
Multiline: <?php 
    if ($a) { 
     echo $b; 
    } 
?>. 
""" 

print strip_php(INPUT)

輸出：

 
Simple: . 
Two on one line: (keep this) . 
Multiline: .

我希望你不使用這種淨化輸入，因爲這是不是爲此目的不夠好。（這是一個黑名單，而不是一個白名單和黑名單是遠遠不夠的。）

如果要處理複雜的情況下，如：

<?php echo '?>' ?>

你仍然可以做它用正則表達式，但你不妨重新考慮你使用的是什麼工具，因爲正則表達式可能太複雜而無法閱讀。下面的正則表達式將處理所有的弗朗西斯阿維拉的測試用例：

dstr = r'"(?:[^"\\]|\\.)*"' 
sstr = r"'(?:[^'\\]|\\.)*'" 
_PHP_TAG = re.compile(
    r'''<\?[^"']*?(?:(?:%s|%s)[^"']*?)*(?:\?>|$)''' % (dstr, sstr) 
) 
def strip_php(content): 
    return _PHP_TAG.sub('', content)

正則表達式幾乎強大到足以解決這個問題。我們知道這是因爲PHP使用正則表達式來標記PHP源代碼。您可以閱讀PHP在Zend/zend_language_scanner.l中使用的正則表達式。它是爲Lex編寫的，這是一個從正則表達式創建分詞器的常用工具。

我說「幾乎」的原因是因爲我們實際上使用擴展正則表達式。

來源

2012-04-23 23:13:51

其在我的情況下工作，但沒有're.DOTALL | re.MULTILINE' – wtayyeb 2012-04-24 02:18:45

你是對的，都不是必需的。我正在玩正則表達式時忘了帶出它們。 – 2012-04-24 03:58:54

-1

你可以做到這一點，通過這一點：

content = re.sub('\n','', content) 
content = re.sub('<\?php(.*)\?>', '', content)

後OP的評論更新答案：

content = re.sub('\n',' {NEWLINE} ', content) 
content = re.sub('<\?php(.*)\?>', '', content) 
content = re.sub(' {NEWLINE} ','\n', content)

例如ipython：

In [81]: content 
Out[81]: ' 11111 <?php 222\n\n?> \n22222\nasd <?php asd\nasdasd\n?>\n3333\n' 

In [82]: content = re.sub('\n',' {NEWLINE} ', content) 
In [83]: content 
Out[83]: ' 11111 <?php 222 {NEWLINE} {NEWLINE} ?> {NEWLINE} 22222 {NEWLINE} asd <?php asd {NEWLINE} asdasd {NEWLINE} ?> {NEWLINE} 3333 {NEWLINE} ' 

In [84]: content = re.sub('<\?php(.*)\?>', '', content) 
In [85]: content 
Out[85]: ' 11111 {NEWLINE} 3333 {NEWLINE} ' 

In [88]: content = re.sub(' {NEWLINE} ','\n', content) 
In [89]: content 
Out[89]: ' 11111 \n3333\n'

來源

2012-04-23 21:12:07

以及如何將不需要的更改轉換爲\ n？ – wtayyeb 2012-04-23 21:18:03

你有權！您可以使用一種解決方案，將新行替換爲不可能包含在文件中的「something」。然後運行正則表達式以過濾掉php標籤。最後，用換行符替換'something'。 – 2012-04-23 21:32:37

我很抱歉，但它刪除了所有的字符串。它返回的字符串只包含一個\ n而沒有其他字符，也許整個字符串都與第二行匹配。 – wtayyeb 2012-04-23 21:36:57

你解決不了這個問題，常用表達。從一個字符串解析PHP需要一個真正的解析器，它至少能夠理解一點PHP。

但是，如果您有PHP可用，您可以很容易地解決這個問題。 PHP解決方案。

這裏是你有多少種方法去錯了你的正則表達式的方法演示：

import re 

testcases = { 
    'easy':("""show this<?php echo 'NOT THIS'?>""",'show this'), 
    'multiple tags':("""<?php echo 'NOT THIS';?>show this, even though it's conditional<?php echo 'NOT THIS'?>""","show this, even though it's conditional"), 
    'omitted ?>':("""show this <?php echo 'NOT THIS';""", 'show this '), 
    'nested string':("""show this <?php echo '<?php echo "NOT THIS" ?>'?> show this""",'show this show this'), 
    'shorttags':("""show this <? echo 'NOT THIS SHORTTAG!'?> show this""",'show this show this'), 
    'echotags':("""<?php $TEST = "NOT THIS"?>show this <?=$TEST?> show this""",'show this show this'), 
} 

testfailstr = """ 
FAILED: %s 
IN:  %s 
EXPECT: %s 
GOT: %s 
""" 

removephp = re.compile(r'(?s)<\?php.*\?>') 

for testname, (in_, expect) in testcases.items(): 
    got = removephp.sub('',in_) 
    if expect!=got: 
     print testfailstr % tuple(map(repr, (testname, in_, expect, got)))

請注意，這是非常困難的，如果不是不可能得到一個正則表達式來通過所有的測試案例。

如果你有PHP可用，你可以使用PHP的tokenizer來去除PHP。以下代碼應該去掉全部 PHP代碼從字符串中排除不會失敗，並且應該覆蓋所有奇怪的角落案例。

// one-character token, always code 
define('T_ONECHAR_TOKEN', 'T_ONECHAR_TOKEN'); 

function strip_php($input) { 
    $tokens = token_get_all($input); 

    $output = ''; 
    $inphp = False; 
    foreach ($tokens as $token) { 
     if (is_string($token)) { 
      $token = array(T_ONECHAR_TOKEN, $token); 
     } 
     list($id, $str) = $token; 
     if (!$inphp) { 
      if ($id===T_OPEN_TAG or $id==T_OPEN_TAG_WITH_ECHO) { 
       $inphp = True; 
      } else { 
       $output .= $str; 
      } 
     } else { 
      if ($id===T_CLOSE_TAG) { 
       $inphp = False; 
      } 
     } 
    } 

    return $output; 
} 

$test = 'a <?php //NOT THIS?>show this<?php //NOT THIS'; 


echo strip_php($test);

來源

2012-04-23 22:54:44

我欣賞實際答案，旁邊解釋爲什麼正則表達式不足。尼斯。 – 2012-04-23 23:23:51

我發佈了一個處理所有測試用例的正則表達式。我懷疑像大多數標記器一樣，PHP標記器首先建立在正則表達式上。通過使用PHP標記器，您可以省去編寫正則表達式的工作，但實際上仍然使用正則表達式。 – 2012-04-23 23:26:36

如果您好奇，可以閱讀PHP在PHP源代碼文件「Zend/zend_language_scanner.l」中使用的正則表達式。 – 2012-04-23 23:33:42

從一個字符串中刪除PHP標記與Python

回答

相關問題