使用python腳本從informer.com抓取和下載文件

爲了研究目的，我需要構建一組良性程序。首先，我需要從http://downloads.informer.com獲得這些程序。爲此，我編寫了一個python腳本，用於迭代每個下載頁面並將下載鏈接提取到列表中。之後，腳本使用這些鏈接下載程序（這些程序是exe，msi或zip文件）。不幸的是，在這一步，腳本運行時出現錯誤，指出（AttributeError：'Request'對象沒有'decode'屬性）。使用python腳本從informer.com抓取和下載文件

下面是一個單頁和retrevies單個程序上工作（爲簡單起見）的腳本：

import wget 
from urllib.request import urlopen as uReq 
from bs4 import BeautifulSoup as soup 
my_url = 'http://sweet-home-3d.informer.com/download' 

import urllib.request 
req = urllib.request.Request(
    my_url, 
    data=None, 
    headers={ 
     'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36' 
    } 
) 


uClient = uReq(req) 
page_html = uClient.read() 

page_soup = soup(page_html, 'lxml') 

cont01 = page_soup.findAll('a', {'class':'download_button'}) 

conts = cont01[1] 
ref= conts['href'] 

addr = urllib.request.Request(
    ref, 
    data=None, 
    headers={ 
     'User-Agent': 'Mozilla/5.0' 
    } 
) 
wget.download(addr)

我得到的錯誤是以下幾點：

AttributeError       Traceback (most recent call last) 
<ipython-input-1-93c4caaa1777> in <module>() 
    31  } 
    32) 
---> 33 wget.download(addr) 

C:\Users\bander\Anaconda3\lib\site-packages\wget.py in download(url, out, bar) 
    503 
    504  # get filename for temp file in current directory 
--> 505  prefix = detect_filename(url, out) 
    506  (fd, tmpfile) = tempfile.mkstemp(".tmp", prefix=prefix, dir=".") 
    507  os.close(fd) 

C:\Users\bander\Anaconda3\lib\site-packages\wget.py in detect_filename(url, out, headers, default) 
    482   names["out"] = out or '' 
    483  if url: 
--> 484   names["url"] = filename_from_url(url) or '' 
    485  if headers: 
    486   names["headers"] = filename_from_headers(headers) or '' 

C:\Users\bander\Anaconda3\lib\site-packages\wget.py in filename_from_url(url) 
    228  """:return: detected filename as unicode or None""" 
    229  # [ ] test urlparse behavior with unicode url 
--> 230  fname = os.path.basename(urlparse.urlparse(url).path) 
    231  if len(fname.strip(" \n\t.")) == 0: 
    232   return None 

C:\Users\bander\Anaconda3\lib\urllib\parse.py in urlparse(url, scheme, allow_fragments) 
    292  Note that we don't break the components up in smaller bits 
    293  (e.g. netloc is a single string) and we don't expand % escapes.""" 
--> 294  url, scheme, _coerce_result = _coerce_args(url, scheme) 
    295  splitresult = urlsplit(url, scheme, allow_fragments) 
    296  scheme, netloc, url, query, fragment = splitresult 

C:\Users\bander\Anaconda3\lib\urllib\parse.py in _coerce_args(*args) 
    112  if str_input: 
    113   return args + (_noop,) 
--> 114  return _decode_args(args) + (_encode_result,) 
    115 
    116 # Result objects are more helpful than simple tuples 

C:\Users\bander\Anaconda3\lib\urllib\parse.py in _decode_args(args, encoding, errors) 
    96 def _decode_args(args, encoding=_implicit_encoding, 
    97      errors=_implicit_errors): 
---> 98  return tuple(x.decode(encoding, errors) if x else '' for x in args) 
    99 
    100 def _coerce_args(*args): 

C:\Users\bander\Anaconda3\lib\urllib\parse.py in <genexpr>(.0) 
    96 def _decode_args(args, encoding=_implicit_encoding, 
    97      errors=_implicit_errors): 
---> 98  return tuple(x.decode(encoding, errors) if x else '' for x in args) 
    99 
    100 def _coerce_args(*args): 

AttributeError: 'Request' object has no attribute 'decode'

我會gratefull如果有人能幫我解決這個問題。預先感謝。

來源

2017-08-12 Bander

你的問題可能是「地址」不是真正的鏈接文件，但它重定向。嘗試點擊運行Chrome Inspector的鏈接（選擇「網絡」標籤），然後查看它從何處獲取實際內容。 – jlaur

你是對的，實際的鏈接是不同的。例如，它在檢查器中顯示以下鏈接： http://download.informer.com/win-1193020099-a188ca2c-5607e42f/flow_v111_full.zip 但是，當我將它放入addr中時，它會再次出現相同的錯誤。以及如何抓取這個實際的鏈接。 – Bander

您可以嘗試追蹤最終目的地。看看這個：https://stackoverflow.com/a/20475712/8240959 – jlaur

其實你不需要硒。這是一個cookie問題。我相信你也可以用urllib做點餅乾，但那不是我的專業領域。

如果你做的工作 - 沒有瀏覽器和wget - 請求中，你可以抓住的文件，像這樣：

import requests 
from bs4 import BeautifulSoup as bs 

# you need headers or the site won't let you grab the data 
headers = { 

    "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3181.0 Safari/537.36" 
} 
url = 'http://sweet-home-3d.informer.com/download/' 

# you need a cookie to download. Create a persistens session 
s = requests.Session() 
r = s.get(url, headers=headers) 
soup = bs(r.text, "html.parser") 

# all download options lie in a div with class table 
links_table = soup.find('div', {'class': 'table'}) 
file_name = links_table.find('div', {'class': 'table-cell file_name'})['title'] 
download_link = links_table.find('a', {'class': 'download_button'})['href'] 

# for some reason the url-page doesn't set the cookie you need. 
# the subpages do, so we need to get it from one of them - before we call download_link 

cookie_link = links_table.a['href'] 
r = s.get(cookie_link, headers=headers) 

# now with a cookie set, we can download the file 
r = s.get(download_link,headers=headers) 
with open(file_name, 'wb') as f: 
    f.write(r.content)

來源

2017-08-12 22:34:22 jlaur

謝謝jlaur，它與你的代碼工作正常。 – Bander

不客氣。請接受答案來解決問題。將它打開會導致很多其他有用的人徒勞地訪問你的問題...... – jlaur

Wget提供HTTP錯誤503：當直接使用正確的URL調用時，服務暫時不可用。我想它在服務器上被阻塞了。下載鏈接由JavaScript生成。你可以使用Selenium。這將執行JavaScript來獲取URL。我用PhantomJS試過Selenium，但沒有奏效。但是，它使用Chrome。

首先安裝硒：

sudo pip3 install selenium

然後拿到駕駛https://sites.google.com/a/chromium.org/chromedriver/downloads，並把它放在你的路徑。如果您（不像我）在Windows或Mac上，可以使用無鍍鉻「Chrome Canary」版本。

from selenium import webdriver 
from time import sleep 

url = 'http://sweet-home-3d.informer.com/download' 
browser = webdriver.Chrome() 
browser.get(url) 
browser.find_element_by_class_name("download_btn").click() 
sleep(360) # give it plenty of time to download this will depend on you internet connection 
browser.quit()

該文件將被下載到你下載文件夾。如果過早退出，您將獲得文件擴展名爲.crdownload的部分文件。如果發生這種情況，會增加您通過睡眠的數值。

來源

2017-08-12 12:19:58

使用python腳本從informer.com抓取和下載文件

回答

相關問題