正則表達式：去除除SRC以外的HTML屬性

我想寫一個正則表達式，它將去除除SRC屬性以外的所有標記屬性。例如：正則表達式：去除除SRC以外的HTML屬性

<p id="paragraph" class="green">This is a paragraph with an image <img src="/path/to/image.jpg" width="50" height="75"/></p>

將返回：

<p>This is a paragraph with an image <img src="/path/to/image.jpg" /></p>

我有一個正則表達式來去除所有屬性，但我想調整它在SRC離開。這是我到目前爲止：

<?php preg_replace('/<([A-Z][A-Z0-9]*)(\b[^>]*)>/i', '<$1>', '<html><goes><here>');

使用PHP的preg_replace（）爲此。

謝謝！ Ian

來源

2010-06-08 Ian McIntyre Silber

您可以使用正則表達式解析HTML。並非所有的HTML。但是如果你確切地知道你正在接收什麼，你可以使用正則表達式。這是一場宗教戰爭，由人們假設所有情況下都有無限的堆疊和記憶。 – 2010-06-08 08:32:32

好吧，這是我用這似乎運作良好：

<([A-Z][A-Z0-9]*)(\b[^>src]*)(src\=[\'|"|\s]?[^\'][^"][^\s]*[\'|"|\s]?)?(\b[^>]*)>

隨意戳任何洞。

來源

2010-06-08 21:32:41

Youusuallyshould not parse HTML using regular expressions。

請改爲撥打DOMDocument::loadHTML。
然後，您可以通過文檔中的元素進行遞歸併調用removeAttribute。

來源

2010-06-08 02:34:55 SLaks

一些人，當遇到一個問題，認爲「我知道，我將使用正則表達式。」現在他們有兩個問題。 – fmark 2010-06-08 04:25:42

您可以使用正則表達式解析HTML。並非所有的HTML。但是如果你確切地知道你正在接收什麼，你可以使用正則表達式。這是一場宗教戰爭，由人們假設所有情況下都有無限的堆疊和記憶。 – 2010-06-08 08:32:59

有些人有一個可怕的習慣，就是不回答這個問題，而是沉迷於曼陀。這應該是低調的，而不是由宗教權利提高。 – 2010-06-08 08:33:33

不幸的是，我不知道如何回答這個問題的PHP。如果我是使用Perl我會做到以下幾點：

use strict; 
my $data = q^<p id="paragraph" class="green">This is a paragraph with an image <img src="/path/to/image.jpg" width="50" height="75"/></p>^; 

$data =~ s{ 
    <([^/> ]+)([^>]+)> # split into tagtype, attribs 
}{ 
    my $attribs = $2; 
    my @parts = split(/\s+/, $attribs); # separate by whitespace 
    @parts = grep { m/^src=/i } @parts; # retain just src tags 
    if (@parts) { 
     "<" . join(" ", $1, @parts) . ">"; 
    } else { 
     "<" . $1 . ">"; 
    } 
}xseg; 

print($data);

<p>This is a paragraph with an image <img src="/path/to/image.jpg"></p>

來源

2010-06-08 08:40:59

這可能會爲您的工作需要：

$text = '<p id="paragraph" class="green">This is a paragraph with an image <img src="/path/to/image.jpg" width="50" height="75"/></p>'; 

echo preg_replace("/<([a-z][a-z0-9]*)(?:[^>]*(\ssrc=['\"][^'\"]*['\"]))?[^>]*?(\/?)>/i",'<$1$2$3>', $text); 

// <p>This is a paragraph with an image <img src="/path/to/image.jpg"/></p>

正則表達式細分：

/    # Start Pattern 
<    # Match '<' at beginning of tags 
(   # Start Capture Group $1 - Tag Name 
    [a-z]   # Match 'a' through 'z' 
    [a-z0-9]*  # Match 'a' through 'z' or '0' through '9' zero or more times 
)    # End Capture Group 
(?:   # Start Non-Capture Group 
    [^>]*   # Match anything other than '>', Zero or More Times 
    (   # Start Capture Group $2 - ' src="...."' 
    \s   # Match one whitespace 
    src=   # Match 'src=' 
    ['"]   # Match ' or " 
    [^'"]*  # Match anything other than ' or " 
    ['"]   # Match ' or " 
)    # End Capture Group 2 
)?   # End Non-Capture Group, match group zero or one time 
[^>]*?  # Match anything other than '>', Zero or More times, not-greedy (wont eat the /) 
(\/?)   # Capture Group $3 - '/' if it is there 
>    # Match '>' 
/i   # End Pattern - Case Insensitive

添加一些報價，並使用替換文本<$1$2$3>應該從良好剝離任何非src=性質形成了HTML標籤。

請注意這不一定要去上ALL投入工作，因爲反HTML + RegExp的人都是這樣巧妙下面值得注意。在PHP

來源

2010-06-08 21:52:53 gnarf

除非'>'出現在屬性值中。解析邪惡的HTML是_hard_。另外，你忘了逃避'\'。 – SLaks 2010-06-08 22:09:38

哪個'\'我忘了逃跑？ – gnarf 2010-06-08 22:16:43

+1對錶達的一個很好的解釋。 – Anthony 2012-07-27 15:32:23

有幾回退，最值得注意的是<p style=">">竟又<p>">和其他一些破碎的問題......我會建議看Zend_Filter_StripTags作爲一個完整的證據標籤/屬性過濾器如上面介紹的，你不應該使用正則表達式解析html或xml。

我會用str_replace（）做你的例子;如果它的所有時間都一樣。

$str = '<p id="paragraph" class="green">This is a paragraph with an image <img src="/path/to/image.jpg" width="50" height="75"/></p>'; 

$str = str_replace('id="paragraph" class="green"', "", $str); 

$str = str_replace('width="50" height="75"',"",$str);

來源

2010-06-08 22:28:54 streetparade

發帖爲甲骨文正則表達式

提供解決方案

<([^!][a-z][a-z0-9]*)([^>]*(\ssrc=[''''\"][^''''\"]*[''''\"]))?[^>]*?(\/?)>

來源

2015-06-17 04:37:09

正則表達式：去除除SRC以外的HTML屬性

回答

相關問題