解析html文件並將找到的圖像添加到zip文件

我想解析所有img標籤的html，下載src指向的所有圖像，然後將這些文件添加到zip文件。我寧願在記憶中做所有這些，因爲我可以保證不會有那麼多的圖像。解析html文件並將找到的圖像添加到zip文件

假設圖像變量已經從解析html中填充。我需要幫助的是將圖像放入zip文件中。

from zipfile import ZipFile 
from StringIO import StringIO 
from urllib2 import urlopen 

s = StringIO() 
zip_file = ZipFile(s, 'w') 
try: 
    for image in images: 
     internet_image = urlopen(image) 
     zip_file.writestr('some-image.jpg', internet_image.fp.read()) 
     # it is not obvious why I have to use writestr() instead of write() 
finally: 
    zip_file.close()

來源

2009-12-22 Jason Christa

使用的urllib2/LXML/XPath的/谷歌 – mykhal 2009-12-22 22:22:51

第二布萊恩·阿格紐的言論，看起來你已經差不多把一切都整理。你必須使用zip_file.writestr（），因爲你是從一個字節緩衝區（即：一個字節字符串）寫入數據，而不是從位於文件系統上的文件（這是zip_file.write（）希望接收的文件）。 – 2009-12-22 23:29:37

不要忘記其中引用的樣式表和圖像... – 2013-08-19 21:37:28

要回答關於如何創建ZIP歸檔文件的其他問題（其他人在此討論瞭解析URL），我測試了您的代碼。你已經非常接近完成產品了。

以下是我將如何擴充您必須創建Zip存檔的內容（在本例中，我正在將存檔寫入驅動器，以便我可以驗證它是否已正確書寫）。

from zipfile import ZipFile, ZipInfo, ZIP_DEFLATED 
import zlib 
from cStringIO import StringIO 
from urllib2 import urlopen 
from urlparse import urlparse 
from os import path 

images = ['http://sstatic.net/so/img/logo.png', 
      'http://sstatic.net/so/Img/footer-cc-wiki-peak-internet.png'] 

buf = StringIO() 
# By default, zip archives are not compressed... adding ZIP_DEFLATED 
# to achieve that. If you don't want that, or don't have zlib on or 
# system, delete the compression kwarg 
zip_file = ZipFile(buf, mode='w', compression=ZIP_DEFLATED) 

for image in images: 
    internet_image = urlopen(image) 
    fname = path.basename(urlparse(image).path) 
    zip_file.writestr(fname, internet_image.read()) 

zip_file.close() 

output = open('images.zip', 'wb') 
output.write(buf.getvalue()) 
output.close() 
buf.close()

來源

2009-12-22 23:53:29

我不太清楚你在這裏問什麼，因爲你似乎有大部分排序。

您是否調查過HtmlParser實際執行HTML解析？我不會嘗試自己手動翻譯解析器 - 這是一個有許多邊緣案例的主要任務。除了最微不足道的情況外，別考慮任何其他的正則表達式。

對於每個<img/>標記，您可以使用HttpLib實際獲取每個圖像。在多個線程中獲取圖像可能會加快編譯zip文件的速度。

來源

2009-12-22 22:24:42

+1用於建議解析html！ – Mongoose 2009-12-22 22:31:34

Downvoted爲什麼？ – 2009-12-22 22:50:30

我能想到的最簡單的方法就是使用BeautifulSoup庫。

線沿線的東西：

from BeautifulSoup import BeautifulSoup 
from collections import defaultdict 

def getImgSrces(html): 
    srcs = [] 
    soup = BeautifulSoup(html) 

    for tag in soup('img'): 
     attrs = defaultdict(str) 
     for attr in tag.attrs: 
      attrs[ attr[0] ] = attr[1] 
     attrs = dict(attrs) 

     if 'src' in attrs.keys(): 
      srcs.append(attrs['src']) 

    return srcs

這應該給你從你的img標籤通過派生循環的URL列表。

來源

2009-12-22 22:31:05 KingRadical

爲什麼不只有：'for attr in tag.attrs：if attr [0] =='src'：srcs.append（attr [1]）'而不是？爲什麼要打擾你的attrs字典？ – 2009-12-23 00:16:19

我剛剛寫了一個例程，我寫了一個例程，我想要一個所有屬性的字典，儘管你可以這樣做。雖然我不確定在性能方面有太多收穫。 – KingRadical 2009-12-23 16:55:56

解析html文件並將找到的圖像添加到zip文件

回答

相關問題