爲了研究目的,我需要構建一組良性程序。首先,我需要從http://downloads.informer.com獲得這些程序。爲此,我編寫了一個python腳本,用於迭代每個下載頁面並將下載鏈接提取到列表中。之後,腳本使用這些鏈接下載程序(這些程序是exe,msi或zip文件)。不幸的是,在這一步,腳本運行時出現錯誤,指出(AttributeError:'Request'對象沒有'decode'屬性)。使用python腳本從informer.com抓取和下載文件
下面是一個單頁和retrevies單個程序上工作(爲簡單起見)的腳本:
import wget
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'http://sweet-home-3d.informer.com/download'
import urllib.request
req = urllib.request.Request(
my_url,
data=None,
headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
}
)
uClient = uReq(req)
page_html = uClient.read()
page_soup = soup(page_html, 'lxml')
cont01 = page_soup.findAll('a', {'class':'download_button'})
conts = cont01[1]
ref= conts['href']
addr = urllib.request.Request(
ref,
data=None,
headers={
'User-Agent': 'Mozilla/5.0'
}
)
wget.download(addr)
我得到的錯誤是以下幾點:
AttributeError Traceback (most recent call last)
<ipython-input-1-93c4caaa1777> in <module>()
31 }
32)
---> 33 wget.download(addr)
C:\Users\bander\Anaconda3\lib\site-packages\wget.py in download(url, out, bar)
503
504 # get filename for temp file in current directory
--> 505 prefix = detect_filename(url, out)
506 (fd, tmpfile) = tempfile.mkstemp(".tmp", prefix=prefix, dir=".")
507 os.close(fd)
C:\Users\bander\Anaconda3\lib\site-packages\wget.py in detect_filename(url, out, headers, default)
482 names["out"] = out or ''
483 if url:
--> 484 names["url"] = filename_from_url(url) or ''
485 if headers:
486 names["headers"] = filename_from_headers(headers) or ''
C:\Users\bander\Anaconda3\lib\site-packages\wget.py in filename_from_url(url)
228 """:return: detected filename as unicode or None"""
229 # [ ] test urlparse behavior with unicode url
--> 230 fname = os.path.basename(urlparse.urlparse(url).path)
231 if len(fname.strip(" \n\t.")) == 0:
232 return None
C:\Users\bander\Anaconda3\lib\urllib\parse.py in urlparse(url, scheme, allow_fragments)
292 Note that we don't break the components up in smaller bits
293 (e.g. netloc is a single string) and we don't expand % escapes."""
--> 294 url, scheme, _coerce_result = _coerce_args(url, scheme)
295 splitresult = urlsplit(url, scheme, allow_fragments)
296 scheme, netloc, url, query, fragment = splitresult
C:\Users\bander\Anaconda3\lib\urllib\parse.py in _coerce_args(*args)
112 if str_input:
113 return args + (_noop,)
--> 114 return _decode_args(args) + (_encode_result,)
115
116 # Result objects are more helpful than simple tuples
C:\Users\bander\Anaconda3\lib\urllib\parse.py in _decode_args(args, encoding, errors)
96 def _decode_args(args, encoding=_implicit_encoding,
97 errors=_implicit_errors):
---> 98 return tuple(x.decode(encoding, errors) if x else '' for x in args)
99
100 def _coerce_args(*args):
C:\Users\bander\Anaconda3\lib\urllib\parse.py in <genexpr>(.0)
96 def _decode_args(args, encoding=_implicit_encoding,
97 errors=_implicit_errors):
---> 98 return tuple(x.decode(encoding, errors) if x else '' for x in args)
99
100 def _coerce_args(*args):
AttributeError: 'Request' object has no attribute 'decode'
我會gratefull如果有人能幫我解決這個問題。 預先感謝。
你的問題可能是「地址」不是真正的鏈接文件,但它重定向。嘗試點擊運行Chrome Inspector的鏈接(選擇「網絡」標籤),然後查看它從何處獲取實際內容。 – jlaur
你是對的,實際的鏈接是不同的。例如,它在檢查器中顯示以下鏈接: http://download.informer.com/win-1193020099-a188ca2c-5607e42f/flow_v111_full.zip 但是,當我將它放入addr中時,它會再次出現相同的錯誤。以及如何抓取這個實際的鏈接。 – Bander
您可以嘗試追蹤最終目的地。看看這個:https://stackoverflow.com/a/20475712/8240959 – jlaur