用機械化下載文件

我有一個打開頁面的瀏覽器實例。我想下載並保存所有鏈接（它們是PDF）。有人知道該怎麼做嗎？用機械化下載文件

THX

來源

2011-10-03 Dave

可能不是你要找的答案，但我已經使用LXML並請求庫一起自動錨取：

相關LXML例子http://lxml.de/lxmlhtml.html#examples（與要求更換的urllib）

而請求庫主頁http://docs.python-requests.org/en/latest/index.html

它不像機械化一樣緊湊，但提供更多的控制。

來源

2011-10-03 21:34:43 David

嗨大衛，我打算試一試吧 – Dave

import urllib, urllib2,cookielib, re 
#http://www.crummy.com/software/BeautifulSoup/ - required 
from BeautifulSoup import BeautifulSoup 

HOST = 'https://www.adobe.com/' 

cj = cookielib.CookieJar() 
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj)) 

req = opener.open(HOST + 'pdf') 
responce = req.read() 

soup = BeautifulSoup(responce) 
pdfs = soup.findAll(name = 'a', attrs = { 'href': re.compile('\.pdf') }) 
for pdf in pdfs: 
    if 'https://' not in pdf['href']: 
     url = HOST + pdf['href'] 
    else: 
     url = pdf['href'] 
    try: 
     #http://docs.python.org/library/urllib.html#urllib.urlretrieve 
     urllib.urlretrieve(url) 
    except Exception, e: 
     print 'cannot obtain url %s' % (url,) 
     print 'from href %s' % (pdf['href'],) 
     print e 
    else: 
     print 'downloaded file' 
     print url

來源

2011-10-03 21:42:52 cetver

作爲BeautifulSoup的忠實粉絲，謹慎的一句話是圖書館不再被積極開發。 http://www.crummy.com/software/BeautifulSoup/3.1-problems.html大多數熟悉BS的人建議我過渡到Lxml – David

Thanx，我不知道。 – cetver

我相信BeautifulSoup仍在積極開發中 –

用機械化下載文件

回答

相關問題