2017-08-12 138 views
3

爲了研究目的,我需要構建一組良性程序。首先,我需要從http://downloads.informer.com獲得這些程序。爲此,我編寫了一個python腳本,用於迭代每個下載頁面並將下載鏈接提取到列表中。之後,腳本使用這些鏈接下載程序(這些程序是exe,msi或zip文件)。不幸的是,在這一步,腳本運行時出現錯誤,指出(AttributeError:'Request'對象沒有'decode'屬性)。使用python腳本從informer.com抓取和下載文件

下面是一個單頁和retrevies單個程序上工作(爲簡單起見)的腳本:

import wget 
from urllib.request import urlopen as uReq 
from bs4 import BeautifulSoup as soup 
my_url = 'http://sweet-home-3d.informer.com/download' 

import urllib.request 
req = urllib.request.Request(
    my_url, 
    data=None, 
    headers={ 
     'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36' 
    } 
) 


uClient = uReq(req) 
page_html = uClient.read() 

page_soup = soup(page_html, 'lxml') 

cont01 = page_soup.findAll('a', {'class':'download_button'}) 

conts = cont01[1] 
ref= conts['href'] 

addr = urllib.request.Request(
    ref, 
    data=None, 
    headers={ 
     'User-Agent': 'Mozilla/5.0' 
    } 
) 
wget.download(addr) 

我得到的錯誤是以下幾點:

AttributeError       Traceback (most recent call last) 
<ipython-input-1-93c4caaa1777> in <module>() 
    31  } 
    32) 
---> 33 wget.download(addr) 

C:\Users\bander\Anaconda3\lib\site-packages\wget.py in download(url, out, bar) 
    503 
    504  # get filename for temp file in current directory 
--> 505  prefix = detect_filename(url, out) 
    506  (fd, tmpfile) = tempfile.mkstemp(".tmp", prefix=prefix, dir=".") 
    507  os.close(fd) 

C:\Users\bander\Anaconda3\lib\site-packages\wget.py in detect_filename(url, out, headers, default) 
    482   names["out"] = out or '' 
    483  if url: 
--> 484   names["url"] = filename_from_url(url) or '' 
    485  if headers: 
    486   names["headers"] = filename_from_headers(headers) or '' 

C:\Users\bander\Anaconda3\lib\site-packages\wget.py in filename_from_url(url) 
    228  """:return: detected filename as unicode or None""" 
    229  # [ ] test urlparse behavior with unicode url 
--> 230  fname = os.path.basename(urlparse.urlparse(url).path) 
    231  if len(fname.strip(" \n\t.")) == 0: 
    232   return None 

C:\Users\bander\Anaconda3\lib\urllib\parse.py in urlparse(url, scheme, allow_fragments) 
    292  Note that we don't break the components up in smaller bits 
    293  (e.g. netloc is a single string) and we don't expand % escapes.""" 
--> 294  url, scheme, _coerce_result = _coerce_args(url, scheme) 
    295  splitresult = urlsplit(url, scheme, allow_fragments) 
    296  scheme, netloc, url, query, fragment = splitresult 

C:\Users\bander\Anaconda3\lib\urllib\parse.py in _coerce_args(*args) 
    112  if str_input: 
    113   return args + (_noop,) 
--> 114  return _decode_args(args) + (_encode_result,) 
    115 
    116 # Result objects are more helpful than simple tuples 

C:\Users\bander\Anaconda3\lib\urllib\parse.py in _decode_args(args, encoding, errors) 
    96 def _decode_args(args, encoding=_implicit_encoding, 
    97      errors=_implicit_errors): 
---> 98  return tuple(x.decode(encoding, errors) if x else '' for x in args) 
    99 
    100 def _coerce_args(*args): 

C:\Users\bander\Anaconda3\lib\urllib\parse.py in <genexpr>(.0) 
    96 def _decode_args(args, encoding=_implicit_encoding, 
    97      errors=_implicit_errors): 
---> 98  return tuple(x.decode(encoding, errors) if x else '' for x in args) 
    99 
    100 def _coerce_args(*args): 

AttributeError: 'Request' object has no attribute 'decode' 

我會gratefull如果有人能幫我解決這個問題。 預先感謝。

+0

你的問題可能是「地址」不是真正的鏈接文件,但它重定向。嘗試點擊運行Chrome Inspector的鏈接(選擇「網絡」標籤),然後查看它從何處獲取實際內容。 – jlaur

+0

你是對的,實際的鏈接是不同的。例如,它在檢查器中顯示以下鏈接: http://download.informer.com/win-1193020099-a188ca2c-5607e42f/flow_v111_full.zip 但是,當我將它放入addr中時,它會再次出現相同的錯誤。以及如何抓取這個實際的鏈接。 – Bander

+0

您可以嘗試追蹤最終目的地。看看這個:https://stackoverflow.com/a/20475712/8240959 – jlaur

回答

1

其實你不需要硒。這是一個cookie問題。我相信你也可以用urllib做點餅乾,但那不是我的專業領域。

如果你做的工作 - 沒有瀏覽器和wget - 請求中,你可以抓住的文件,像這樣:

import requests 
from bs4 import BeautifulSoup as bs 

# you need headers or the site won't let you grab the data 
headers = { 

    "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3181.0 Safari/537.36" 
} 
url = 'http://sweet-home-3d.informer.com/download/' 

# you need a cookie to download. Create a persistens session 
s = requests.Session() 
r = s.get(url, headers=headers) 
soup = bs(r.text, "html.parser") 

# all download options lie in a div with class table 
links_table = soup.find('div', {'class': 'table'}) 
file_name = links_table.find('div', {'class': 'table-cell file_name'})['title'] 
download_link = links_table.find('a', {'class': 'download_button'})['href'] 

# for some reason the url-page doesn't set the cookie you need. 
# the subpages do, so we need to get it from one of them - before we call download_link 

cookie_link = links_table.a['href'] 
r = s.get(cookie_link, headers=headers) 

# now with a cookie set, we can download the file 
r = s.get(download_link,headers=headers) 
with open(file_name, 'wb') as f: 
    f.write(r.content) 
+0

謝謝jlaur,它與你的代碼工作正常。 – Bander

+0

不客氣。請接受答案來解決問題。將它打開會導致很多其他有用的人徒勞地訪問你的問題...... – jlaur

2

Wget提供HTTP錯誤503:當直接使用正確的URL調用時,服務暫時不可用。我想它在服務器上被阻塞了。下載鏈接由JavaScript生成。你可以使用Selenium。這將執行JavaScript來獲取URL。我用PhantomJS試過Selenium,但沒有奏效。但是,它使用Chrome。

首先安裝硒:

sudo pip3 install selenium 

然後拿到駕駛https://sites.google.com/a/chromium.org/chromedriver/downloads,並把它放在你的路徑。如果您(不像我)在Windows或Mac上,可以使用無鍍鉻「Chrome Canary」版本。

from selenium import webdriver 
from time import sleep 

url = 'http://sweet-home-3d.informer.com/download' 
browser = webdriver.Chrome() 
browser.get(url) 
browser.find_element_by_class_name("download_btn").click() 
sleep(360) # give it plenty of time to download this will depend on you internet connection 
browser.quit() 

該文件將被下載到你下載文件夾。如果過早退出,您將獲得文件擴展名爲.crdownload的部分文件。如果發生這種情況,會增加您通過睡眠的數值。