如何使用python從網頁下載文件

我正在嘗試創建一個腳本，用於擦除網頁並下載找到的任何圖像文件。如何使用python從網頁下載文件

我的第一個函數是一個wget函數，它讀取網頁並將其分配給一個變量。我的第二個功能是搜索的一個正則表達式「SSRC =」在一個網頁中的HTML，下面是功能：

def find_image(text): 
    '''Find .gif, .jpg and .bmp files''' 
    documents = re.findall(r'\ssrc="([^"]+)"', text) 
    count = len(documents) 
    print "[+] Total number of file's found: %s" % count 
    return '\n'.join([str(x) for x in documents])

從這個輸出是這樣的：

example.jpg 
image.gif 
http://www.webpage.com/example/file01.bmp

我試圖寫一個使用urllib.urlretrieve（url，filename）下載這些文件的第三個函數，但我不知道如何去做這件事，主要是因爲某些輸出是絕對路徑，而其他人則是相對的。我也不確定如何同時下載這些內容並下載，而不必每次都指定名稱和位置。資源

來源

2016-11-24 Billy King

不要用正則表達式解析html http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – n1c9

路徑無關的取（可處理絕對/相對路徑） -

from bs4 import BeautifulSoup as bs 
import urlparse 
from urllib2 import urlopen 
from urllib import urlretrieve 
import os 

def fetch_url(url, out_folder="test/"): 
    """Downloads all the images at 'url' to /test/""" 
    soup = bs(urlopen(url)) 
    parsed = list(urlparse.urlparse(url)) 

    for image in soup.findAll("img"): 
     print "Image: %(src)s" % image 
     filename = image["src"].split("/")[-1] 
     parsed[2] = image["src"] 
     outpath = os.path.join(out_folder, filename) 
     if image["src"].lower().startswith("http"): 
      urlretrieve(image["src"], outpath) 
     else: 
      urlretrieve(urlparse.urlunparse(parsed), outpath) 

fetch_url('http://www.w3schools.com/html/')

來源

2016-11-24 19:11:57

我不能給你寫的完整代碼，我敢肯定這不是你想和什麼，但這裏有一些提示：

1）做不是用正則表達式解析隨機HTML頁面，有相當多的解析器爲此做了。我建議BeautifulSoup。您將過濾所有img元素並獲取其值src值。

2）隨着src的值，你可以按照你已經做的方式下載你的文件。關於相對/絕對問題，使用urlparse模塊，按照this SO answer。我們的想法是將圖像的src與您下載HTML的URL相加。如果src已經是絕對的，它將保持這種狀態。

3）至於全部下載，只需遍歷你想下載圖片的網頁列表，並對每個頁面中的每張圖片執行步驟1和2。當你說「同時」時，你可能意思是異步下載它們。在這種情況下，我建議下載每個網頁in one thread。

來源

2016-11-24 19:16:00 lucasnadalutti

如何使用python從網頁下載文件

回答

相關問題