注意到那些關於不使用正則表達式的迴應?這是爲什麼?那是因爲HTML代表結構。認爲說實話,HTML代碼過度使用div而不是使用語義標記,但我打算用DOM功能解析它。那麼,這裏是我使用的樣本HTML:
<html>
<body>
<!-- message -->
<div>
Just the text.
</div>
<!--/message -->
<!-- message -->
<div>
<div style="margin-left: 20px; margin-top:5px; ">
<div class="smallfont">Quote:</div>
</div>
<div style="margin-right: 20px; margin-left: 20px; padding: 10px;">
Message from <strong>Nickname</strong>
<div style="font-style:italic">Hello. It's a quote</div>
</div>
<br /><br />
It's the simple text
</div>
<!--/message -->
<!-- message -->
<div>
Text<br />
<div style="margin:20px; margin-top:5px; background-color: #30333D">
<div class="smallfont" style="margin-bottom:2px">PHP code:</div>
<div class="alt2" style="margin:0px; padding:6px; border:1px inset; width:640px; height:482px; overflow:auto; background-color:#FFFACA;">
<code style="white-space:nowrap">
<div dir="ltr" style="text-align:left">
<!-- php buffer start -->
<code>
LALALA PHP CODE
</code>
<!-- php buffer end -->
</div>
</code>
</div>
</div><br />
<br />
More text
</div>
<!--/message -->
</body>
</html>
現在的全碼:
$doc = new DOMDocument();
$doc->loadHTMLFile('test.html');
// These just make the code nicer
// We could just inline them if we wanted to
// ----------- Helper Functions ------------
function HasQuote($part, $xpath) {
// check the div and see if it contains "Quote:" inside
return $xpath->query("div[contains(.,'Quote:')]", $part)->length;
}
function HasPHPCode($part, $xpath) {
// check the div and see if it contains "PHP code:" inside
return $xpath->query("div[contains(.,'PHP code:')]", $part)->length;
}
// ----------- End Helper Functions ------------
// ----------- Parse Functions ------------
function ParseQuote($quote, $xpath) {
// The quote content is actually the next
// next div over. Man this markup is weird.
$quote = $quote->nextSibling->nextSibling;
$quote_info = array('type' => 'quote');
$nickname = $xpath->query("strong", $quote);
if($nickname->length) {
$quote_info['nickname'] = $nickname->item(0)->nodeValue;
}
$quote_text = $xpath->query("div", $quote);
if($quote_text->length) {
$quote_info['quote_text'] = trim($quote_text->item(0)->nodeValue);
}
return $quote_info;
}
function ParseCode($code, $xpath) {
$code_info = array('type' => 'code');
// This matches the path to get down to inner most code element
$code_text = $xpath->query("//div/code/div/code", $code);
if($code_text->length) {
$code_info['code_text'] = trim($code_text->item(0)->nodeValue);
}
return $code_info;
}
// ----------- End Parser Functions ------------
function GetMessages($message, $xpath) {
$message_contents = array();
foreach($message->childNodes as $child) {
// So inside of a message if we hit a div
// We either have a Quote or PHP code, check which
if(strtolower($child->nodeName) == 'div') {
if(HasQuote($child, $xpath)) {
$quote = ParseQuote($child, $xpath);
if($quote['quote_text']) {
$message_contents[] = $quote;
}
}
else if(HasPHPCode($child, $xpath)) {
$phpcode = ParseCode($child, $xpath);
if($phpcode['code_text']) {
$message_contents[] = $phpcode;
}
}
}
// Otherwise check if we've found some pretty text
else if ($child->nodeType == XML_TEXT_NODE) {
// This might be just whitespace, so check that it's not empty
$text = trim($child->nodeValue);
if($text) {
$message_contents[] = array('type' => 'text', 'text' => trim($child->nodeValue));
}
}
}
return $message_contents;
}
$xpath = new DOMXpath($doc);
// We need to get the toplevel divs, which
// are the messages
$toplevel_divs = $xpath->query("//body/div");
$messages = array();
foreach($toplevel_divs as $toplevel_div) {
$messages[] = GetMessages($toplevel_div, $xpath);
}
現在讓我們看看$messages
樣子:
Array
(
[0] => Array
(
[0] => Array
(
[type] => text
[text] => Just the text.
)
)
[1] => Array
(
[0] => Array
(
[type] => quote
[nickname] => Nickname
[quote_text] => Hello. It's a quote
)
[1] => Array
(
[type] => text
[text] => It's the simple text
)
)
[2] => Array
(
[0] => Array
(
[type] => text
[text] => Text
)
[1] => Array
(
[type] => code
[code_text] => LALALA PHP CODE
)
[2] => Array
(
[type] => text
[text] => More text
)
)
)
它是由消息分離然後進一步分成消息中的不同內容!現在我們甚至可以使用像這樣的基本打印功能:
foreach($messages as $message) {
echo "\n\n>>>>>> Message >>>>>>>\n";
foreach($message as $content) {
if($content['type'] == 'text') {
echo "{$content['text']} ";
}
else if($content['type'] == 'quote') {
echo "\n\n======== Quote =========\n";
echo "From: {$content['nickname']}\n\n";
echo "{$content['quote_text']}\n";
echo "=====================\n\n";
}
else if($content['type'] == 'code') {
echo "\n\n======== Code =========\n";
echo "{$content['code_text']}\n";
echo "=====================\n\n";
}
}
}
echo "\n";
我們得到這個!
>>>>>> Message >>>>>>>
Just the text.
>>>>>> Message >>>>>>>
======== Quote =========
From: Nickname
Hello. It's a quote
=====================
It's the simple text
>>>>>> Message >>>>>>>
Text
======== Code =========
LALALA PHP CODE
=====================
More text
由於DOM解析函數能夠理解結構,所以這一切都可以工作。
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – geoffspear 2011-05-22 21:11:11
*(相關)* [最佳方法解析HTML]( http://stackoverflow.com/questions/3577641/best-methods-to-parse-html/3577662#3577662) – Gordon 2011-05-22 21:16:10
那裏,做了與dom。 – 2011-05-22 23:36:55