2010-06-09 81 views
0

我遇到HTML::TreeBuilder問題;它顯示輸出中的mojibake /怪異字符。爲什麼HTML :: TreeBuilder在輸出中顯示mojibake /奇怪的字符?

use strict; 
use WWW::Curl::Easy; 
use HTML::TreeBuilder; 
my $cookie_file ='/tmp/pcook'; 
my $curl = new WWW::Curl::Easy; 
my $response_body; 
my $charset = 'utf-8'; 
$DocOffline::charset = undef; 
$curl->setopt (CURLOPT_URL, 'http://www.breitbart.com/article.php?id=D9G7CR5O0&show_article=1'); 
$curl->setopt (CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.9 (KHTML, like Gecko) Chrome/6.0.400.0 Safari/533.9'); 
$curl->setopt (CURLOPT_HEADER, 0); 
$curl->setopt (CURLOPT_FOLLOWLOCATION, 1); 
$curl->setopt (CURLOPT_AUTOREFERER, 1); 
$curl->setopt (CURLOPT_SSL_VERIFYPEER, 0); 
$curl->setopt (CURLOPT_COOKIEFILE, $cookie_file); 
$curl->setopt (CURLOPT_COOKIEJAR, $cookie_file); 
$curl->setopt (CURLOPT_HEADERFUNCTION, \&headerCallback); 
open (my $fileb, ">", \$response_body); 
$curl->setopt(CURLOPT_WRITEDATA,$fileb); 
my $retcode = $curl->perform; 
if ($retcode == 0) { 
    my $dom_tree = HTML::TreeBuilder->new(); 
    $dom_tree->ignore_elements(qw(script style)); 
    $dom_tree->utf8_mode(1); 
    $dom_tree->parse($response_body); 
    $dom_tree->eof(); 
    print $dom_tree->as_HTML('<>&', ' ', {}); 
} 
sub headerCallback { 
my($data, $pointer) = @_; 
$data =~ m/Content-Type:\s*.*;\s*charset=(.*)/; 
if ($1) { 
    $charset = $1; 
    $charset =~ s/[^a-zA-Z0-9_\-]*//g; 
} 
return length($data); 
} 
+2

您正在打印DOM樹,但您的終端可能不支持UTF-8。嘗試將其寫入文件,然後使用瀏覽器閱讀,首先正確顯示頁面。 – MvanGeest 2010-06-09 14:18:16

+0

我嘗試打印爲CGI到瀏覽器,結果是一樣的 – Vjy 2010-06-09 16:20:04

回答

2

因爲你的代碼是無論在形狀和內容很亂,你甚至沒有做一個簡化的測試情況下你的整個程序中你沒有得到一整天的答案。 MvanGeest也在附帶的問題的評論中產生了誤診。

的問題是,誰寫布賴特巴特的CMS是無能的人,他們插入NCR &#151;(這是一個非打印字符,甚至無效字符)時,他們應該簡單地插入字符U+2014 EM DASH) ;畢竟,文檔編碼被聲明爲UTF-8。 (人們可以清楚地看到編碼應該是Windows-1252,其中編碼點151(十進制)被分配。)

您可以通過顯式的解碼/編碼步驟解決他們的不足之處。

use Encode qw(encode decode); 
⋮ 
my $string_representation = $dom_tree->as_HTML('<>&', ' ', {}); 
my $octets = encode('UTF-8', decode('Windows-1252', $string_representation); 
⋮ 
# send the correct Content-Type header in your CGI program before printing the HTTP body 
print $octets;