哪個Perl編碼爲哪個HTML字符集？

我正在寫一個Perl腳本，它從許多不同的網站獲取各種HTML文檔，並試圖從中提取數據。解碼這些文檔時遇到問題。哪個Perl編碼爲哪個HTML字符集？

我知道如何從一個meta標籤讀取charset如果有，如何讀取如果HTTP標頭提供這些信息。

的結果可能是：

UTF-8
ISO-8859-1
SHIFT_JIS
的Windows 1252

，還有更多

隨着這個知識我想在我的Perl腳本中解碼文檔

#!/usr/bin/perl -w 

use strict; 

use LWP::UserAgent; 
use Encode; 
use Encode::JP; 

# Maybe also use other extensions for Encode 

my $ua = LWP::UserAgent->new; 
my $response = $ua->get($url); #$url is the documents URL 

if ($response->is_success) { 

    my $charset = getcharset($response); 
    # getcharset is a self-written subroutine that reads the charset 
    # from a meta tag or from the HTTP header (not shown in this example) 

    # Now I know the documents charset and want to find its encoding: 

    my $encoding = 'utf-8'; # default 

    if ($charset eq 'utf-8') { 
     $encoding = 'utf-8'; # Here $encoding and $charset are equal 

    } 
    elsif ($charset eq 'Shift_JIS') { 
     $encoding = 'shiftjis'; #here $encoding and $charset are not equal 
    } 
    elsif ($charset eq 'windows-1252') { 
     # Here I have no idea what $encoding should be, since there is no 
     # encoding in the documentation that contains the string "windows" 

    } 
    elsif ($charset eq 'any other character set') { 
     $encoding = ??? 
    } 

    my $content = decode($encoding, $result->content); 

    # Extract data from $content 
}

但是我無法找到一些在野外存在的字符集的正確編碼。

來源

2015-10-14 Hubert Schölnast

你應該'優先使用warnings'爲'-w'的家當線 – Borodin

對於HTML文檔，你需要的是

my $content = $response->decoded_content();

它將使用在HTTP頭中的字符集屬性雙方的價值和META元素需要。

但是我無法找到一些在野外存在的字符集的正確編碼。

編碼不support那些曾經存在過的所有的編碼，但我很驚訝你遇到了一個HTML頁面就無法解碼。這可能只是創建別名的一種情況，但您沒有提供任何細節來幫助我們。

來源

2015-10-14 14:46:11 ikegami

請參閱Encode::Supported。基本上，大部分編碼應該只是 ™。

binmode STDIN, ':encoding(Shift_JIS)'; 
binmode STDIN, ':encoding(windows-1252)';

這兩個工作沒有錯誤。

來源

2015-10-14 14:08:19 choroba

是什麼與解碼HTTP響應呢？ – ikegami

這是否意味着，我將在任何** html文檔中找到的每個**字符集名稱都是一個有效的編碼名稱，我可以在模塊「Encode」中使用它？如果是這樣，爲什麼在這個模塊的文檔中沒有提到這個重要的事實？ –

@HubertSchölnast：不，但其中大部分都是。 – choroba

哪個Perl編碼爲哪個HTML字符集？

回答

相關問題