2016-03-08 76 views
0

我想解析HTML5標籤? 當我解析它的投訴<section>標記。我不希望它給出錯誤。如何解析perl中的HTML5標籤?

錯誤是「</section>」標記丟失。

我輸入的是: -

<?xml version="1.0" encoding="utf-8"?><html xmlns:svg="http://www.w3.org/2000/svg" xmlns="http://www.w3.org/1999/xhtml" xmlns:m="http://www.w3.org/1998/Math/MathML" xml:lang="en" lang="en"> 
<head> 
<link rel="stylesheet" type="text/css" title="day" href="../css/main.css"/> 
<title>Electric Potential and Electric Potential Energy</title> 
<meta charset="UTF-8"/> 
<meta name="dcterms.conformsTo" content="PXE 1.39 ProductLevelReuse"/> 
<meta name="generator" content="PXE Tools version 1.39.69"/> 
</head> 
<body> 
<section class="chapter" ><header><h1 class="title"><span class="number">20</span> Electric Potential and Electric Potential Energy</h1></header> 
<section class="frontmatter"> 
<section class="listgroup"><header><h1 class="title">Big Ideas</h1></header> 
<ol> 
<li><p>Electric potential energy is similar to gravitational potential energy.</p></li> 
</ol> 
</section> 
</section> 
</body> 
</html> 

我的代碼是: -

use warnings ; 
use strict; 
use HTML::Tidy; 
my $file_name ="d:/perl/test.xhtml"; 
undef $/; 
open xhtml_file, '<:encoding(UTF-8)', "$file_name" || die "no htm file found $!"; 
my $contents = <xhtml_file>; 
close (xhtml_file); 
$/ = "\n"; 

my $tidy = HTML::Tidy->new(); 
$tidy->ignore(
       text => qr/DOCTYPE/, 
       text => qr/html/, 
       text => qr/meta/, 
       text => qr/header/ 
); 
$tidy->parse("foo.html", $contents); 
for my $message ($tidy->messages) 
    { 
     print $message->as_string, "\n"; 
    } 

錯誤日誌是: -

foo.html (10:1) Error: <section> is not recognized! 
foo.html (10:1) Warning: discarding unexpected <section> 
foo.html (11:1) Error: <section> is not recognized! 
foo.html (11:1) Warning: discarding unexpected <section> 
foo.html (12:1) Error: <section> is not recognized! 
foo.html (12:1) Warning: discarding unexpected <section> 
foo.html (16:1) Warning: discarding unexpected </section> 
foo.html (17:1) Warning: discarding unexpected </section> 

我怎樣才能解決呢?

+0

解析後你想用它做什麼? – simbabque

+5

[Crossposted](http://www.perlmonks.org/?node_id=1157066)。 – choroba

+2

對於初學者,HTML :: Tiny用於解析HTML,而不是XHTML(「HTML的XML序列化」) – ikegami

回答

0

根據其文檔,HTML::Valid模塊基於www.html-tidy.org並支持HTML5。看起來它會爲您提供您在PerlMonks的post中提到的行號和列號。