2012-07-15 113 views
1

嗨我已經通過各種帖子在這裏,但沒有人回答我的問題,我有兩個問題, 1. 我寫了一個腳本來獲取電子郵件使用poplib,everythig工作正常,直到當我試圖解析電子郵件的正文時,它將擺脫標籤加上其中的數據,現在我放棄並在這裏尋求幫助,希望你們能指導我朝着正確的方向發展我做錯了什麼,或者我應該怎麼做才能使它工作。Python:msg.get_payload()丟棄所需的數據,解決方案想要

這裏的解析器腳本的源

import sys 
import socket 
import poplib 
import email 
import csv 
import re 
try: 
    host = "mail.someserver.com" 
    mail = poplib.POP3(host) 
    print mail.getwelcome() 
    print mail.user("[email protected]") 
    print mail.pass_("qaiaJWkvZT") 
    print mail.stat() 
    print mail.list() 
    print "" 

    emailWriter = csv.writer(open('emailMessages.csv', 'wb'), delimiter=',', quotechar='\'', quoting=csv.QUOTE_MINIMAL) 
    emailWriter.writerow(['Messages']) 
    if mail.stat()[1] > 0: 
     print "You have new mail." 
    else: 
     print "No new mail." 

    print "" 

    numMessages = len(mail.list()[1]) 
    for i in range(numMessages): 
     for j in mail.retr(i+1)[1]: 
      #print j 
      msg = email.message_from_string(j) # new statement 
      print msg.get_payload(decode=True) 
      #emailWriter.writerow([msg.get_payload(decode=True)]) # new statement 

    mail.quit() 
    input("Press any key to continue.") 
except socket.error as e: 
    print "Something went wrong! :(\nREASON:\n{0}:{1}".format(e.errno, e.strerror) 
    raise 
except: 
    print "Something went wrong!", sys.exc_info()[0] 
    raise 

這裏是上面的腳本生成

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.or 
g/TR/xhtml1/DTD/xhtml1-transitional.dtd"> 
<html xmlns="http://www.w3.org/1999/xhtml"> 
<head> 
<style type="text/css"> 
BODY { 







} 
TD { 



} 
TH { 


} 
H1 { 

} 
TABLE,IMG,A { 

} 
</style> 
</head> 
<body> 


<p><strong>PO Number:</strong> 35164</p> 

<p><strong>Ship To:</strong><br /> 
Joe Pasloski<br /> 
16 Redwood Drive<br />Yorkton, SK S3N2X7<br /> 
204-473-2218</p> 


<table cellspacing="0" cellpadding="5" border="1" width="710" align="left"> 
<tr> 



</tr> 
<tr> 



</tr> 
</table> 
</body> 
</html> 

但是如果我改劇本直接打印在循環中j變量中的輸出,它給了我這個

Return-Path: <[email protected]> 
Delivered-To: [email protected] 
Received: (qmail 7636 invoked by uid 48); 14 Jul 2012 23:29:11 -0000 
Date: 14 Jul 2012 23:29:11 -0000 
Message-ID: <[email protected]> 
To: [email protected] 
Subject: Drop Ship Order - Joe Pasloski 
From: Someserver.com <[email protected]> 
X-Mailer: PHP/5.2.17 
MIME-Version: 1.0 
Content-Type: multipart/alternative; 
     boundary="2631183869_50020" 
Reply-to: SomeServer.com <[email protected]> 
X-Antivirus: avast! (VPS 120714-2, 07/15/2012), Inbound message 
X-Antivirus-Status: Clean 

--2631183869_50020 
Content-Type: text/plain; 
     charset="iso-8859-1" 
Content-Transfer-Encoding: 8bit 



--2631183869_50020 
Content-Type: text/html; 
     charset="iso-8859-1" 
Content-Transfer-Encoding: 8bit 

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.or 
g/TR/xhtml1/DTD/xhtml1-transitional.dtd"> 
<html xmlns="http://www.w3.org/1999/xhtml"> 
<head> 
<style type="text/css"> 
BODY { 
     MARGIN-TOP: 10px; 
     MARGIN-BOTTOM: 10px; 
     MARGIN-LEFT: 10px; 
     MARGIN-RIGHT: 10px; 
     FONT-SIZE: 12px; 
     FONT-FAMILY: arial,helvetica,sans-serif 
     PADDING: 0px; 
} 
TD { 
     FONT-SIZE: 12px; 
     FONT-FAMILY: arial,helvetica,sans-serif 
     COLOR: #000000; 
} 
TH { 
     FONT-SIZE: 13px; 
     FONT-FAMILY: arial,helvetica,sans-serif 
} 
H1 { 
    FONT-SIZE: 20px 
} 
TABLE,IMG,A { 
     BORDER: 0px; 
} 
</style> 
</head> 
<body> 


<p><strong>PO Number:</strong> 35164</p> 

<p><strong>Ship To:</strong><br /> 
Joe Pasloski<br /> 
16 Redwood Drive<br />Yorkton, SK S3N2X7<br /> 
204-473-2218</p> 

<p><strong>Items:</strong> 
<table cellspacing="0" cellpadding="5" border="1" width="710" align="left"> 
<tr> 
     <th align="left" width="100">SKU</th> 
     <th align="left" width="550">Product</th> 
     <th align="left" width="60">Qty</th> 
</tr> 
<tr> 
     <td>JJ-Hamper-Firetruck</td> 
     <td>Frankie's Fire Truck Laundry Hamper</td> 
     <td>1</td> 
</tr> 
</table> 
</body> 
</html> 

,如果我需要處理原始消息,我怎麼能效爲了消除不必要的html標籤而不丟失任何數據,消息的主體部分會自動消除?或者,如果可以通過使用get_payload()方法,我可以做些什麼來使其工作。

請幫忙!

2. 還有一種方法可以使用正則表達式獲取表中包含的所有SKU信息嗎?如果你能爲我提供這樣的服務,那將是一個好的選擇。謝謝噸

回答

1

好的我已經找到了答案,我自己的文檔說,所有和Python: How to get HTML body of an email message using poplib?的帖子幫助我向正確的方向..因爲我知道我處理的消息不是多部分類型,而應用get_payload()丟失了html數據,這就是爲什麼我必須實現一些正則表達式例程來剝離原始消息中的html標記,因爲我下載並使用了Aaron Swartz關於原始消息的html2text庫,然後執行了msg.get_payload() ..這裏是我做了什麼

import html2text # added to my source 
numMessages = len(mail.list()[1]) 
    for i in range(numMessages): 
     for j in mail.retr(i+1)[1]: 

      msg = email.message_from_string(html2text.html2text(j)) 
      print msg.get_payload(decode=False) 

這反過來給了我

charset="iso-8859-1" 











BODY { 









} 


TD { 





} 


TH { 




} 


H1 { 



} 


TABLE,IMG,A { 



} 










**PO Number:** 35170 




**Ship To:** 


Tami Curtis 


67 E. Spring Creek Pkwy 

Providence, UT 84332 


4357553197 









SKU 


Product 


Qty 






JJ-Panel-Isabella-BK-PRT 


Isabella Black Damask Curtains (2 Panels) 


1 

現在我只需要用正則表達式清理它,以獲得不必要的換行符/空白和CSS標記的reif。

希望它可以幫助別人:)乾杯!