2011-10-03 73 views
4

我有一個打開頁面的瀏覽器實例。我想下載並保存所有鏈接(它們是PDF)。 有人知道該怎麼做嗎?用機械化下載文件

THX

回答

3
import urllib, urllib2,cookielib, re 
#http://www.crummy.com/software/BeautifulSoup/ - required 
from BeautifulSoup import BeautifulSoup 

HOST = 'https://www.adobe.com/' 

cj = cookielib.CookieJar() 
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj)) 

req = opener.open(HOST + 'pdf') 
responce = req.read() 

soup = BeautifulSoup(responce) 
pdfs = soup.findAll(name = 'a', attrs = { 'href': re.compile('\.pdf') }) 
for pdf in pdfs: 
    if 'https://' not in pdf['href']: 
     url = HOST + pdf['href'] 
    else: 
     url = pdf['href'] 
    try: 
     #http://docs.python.org/library/urllib.html#urllib.urlretrieve 
     urllib.urlretrieve(url) 
    except Exception, e: 
     print 'cannot obtain url %s' % (url,) 
     print 'from href %s' % (pdf['href'],) 
     print e 
    else: 
     print 'downloaded file' 
     print url 
+1

作爲BeautifulSoup的忠實粉絲,謹慎的一句話是圖書館不再被積極開發。 http://www.crummy.com/software/BeautifulSoup/3.1-problems.html大多數熟悉BS的人建議我過渡到Lxml – David

+0

Thanx,我不知道。 – cetver

+2

我相信BeautifulSoup仍在積極開發中 –