2011-04-08 76 views
2

arghhh,這並不容易。我試圖用perl解析一些郵件。我們舉個例子:Perl MIME ::解析器和嵌套bodys中的編碼(message/rfc_822)

From: [email protected] 
Content-Type: multipart/mixed; 
     boundary="----_=_NextPart_001_01CBE273.65A0E7AA" 
To: [email protected] 

This is a multi-part message in MIME format. 

------_=_NextPart_001_01CBE273.65A0E7AA 
Content-Type: multipart/alternative; 
     boundary="----_=_NextPart_002_01CBE273.65A0E7AA" 


------_=_NextPart_002_01CBE273.65A0E7AA 
Content-Type: text/plain; 
     charset="UTF-8" 
Content-Transfer-Encoding: base64 

[base64-content] 
------_=_NextPart_002_01CBE273.65A0E7AA 
Content-Type: text/html; 
     charset="UTF-8" 
Content-Transfer-Encoding: base64 

[base64-content] 
------_=_NextPart_002_01CBE273.65A0E7AA-- 
------_=_NextPart_001_01CBE273.65A0E7AA 
Content-Type: message/rfc822 
Content-Transfer-Encoding: 7bit 

X-MimeOLE: Produced By Microsoft Exchange V6.5 
Content-class: urn:content-classes:message 
MIME-Version: 1.0 
Content-Type: multipart/mixed; 
     boundary="----_=_NextPart_003_01CBE272.13692C80" 
From: [email protected] 
To: [email protected] 

This is a multi-part message in MIME format. 

------_=_NextPart_003_01CBE272.13692C80 
Content-Type: multipart/alternative; 
     boundary="----_=_NextPart_004_01CBE272.13692C80" 


------_=_NextPart_004_01CBE272.13692C80 
Content-Type: text/plain; 
     charset="iso-8859-1" 
Content-Transfer-Encoding: quoted-printable 

=20 

Viele Gr=FC=DFe 

------_=_NextPart_004_01CBE272.13692C80 
Content-Type: text/html; 
     charset="iso-8859-1" 
Content-Transfer-Encoding: quoted-printable 

<html>...</html> 
------_=_NextPart_004_01CBE272.13692C80-- 
------_=_NextPart_003_01CBE272.13692C80 
Content-Type: application/x-zip-compressed; 
     name="abc.zip" 
Content-Transfer-Encoding: base64 
Content-Disposition: attachment; 
     filename="abc.zip" 

[base64-content] 

------_=_NextPart_003_01CBE272.13692C80-- 
------_=_NextPart_001_01CBE273.65A0E7AA-- 

這封郵件是從Outlook發出的,附帶另一封郵件。正如你所看到的,這是一個非常複雜的郵件,它具有許多不同的內容類型(text/plain,text/html,message/rfc_822,application/xyz)... 而rfc_822部分是問題所在。我在Perl 5.8(Debian Squeeze)中編寫了一個腳本,用MIME :: Parser解析這個消息。

use MIME::Parser; 
my $parser = MIME::Parser->new; 
$parser->output_to_core(1); 
my $top_entity = $parser->parse(\*STDIN); 
my $plain_body = ""; 
my $html_body = ""; 
my $content_type; 
foreach my $part ($top_entity->parts_DFS) { 
    $content_type = $part->effective_type; 
    $body = $part->bodyhandle; 
    if ($body) { 
     if ($content_type eq 'text/plain') { 
      $plain_body = $plain_body . "\n" if ($plain_body ne ''); 
      $plain_body = $plain_body . $body->as_string; 
     } elsif ($content_type eq 'text/html') { 
      $html_body = $html_body . "\n" if ($html_body ne ''); 
      $html_body = $html_body . $body->as_string; 
     } 
    } 
} 
# parsing of attachment comes later 
print $plain_body; 

第一個消息部分(base64內容)包含德語元音變音,它們在標準輸出處正確顯示。嵌套的rfc_822消息由MIME :: Parser自動分析,並與頂級主體彙集爲一個實體。您可以看到,嵌套的rfc_822也包含引用打印的德語元音變音。但是這些在STDOUT沒有正確顯示。在打印之前,引用可打印的元音變音正確顯示,但不是base64編碼的元素。我正在嘗試幾個小時來分離提取rfc_822並進行一些編碼,但沒有任何幫助。還有誰可以幫忙?

Regards

回答

1

假設您的控制檯顯示UTF-8,這是有道理的。 它正確地顯示了你已經解碼了什麼,但是,當然,latin1字符沒有正確顯示。
稍後,您將轉換爲UTF-8,但如果數據已經是UTF8,則這沒有意義。所以只顯示前latin1變音符號。

如果不查看內容類型中的「字符集」並相應採取行動,則無法獲得此權限。

+0

好的,謝謝。我明白有什麼問題。我現在正在使用一個PHP腳本,我很喜歡這個腳本。 – rabudde 2011-05-16 04:41:34