2017-01-02 147 views
0

下面是我的代碼來刮一個網站使用美麗的湯..代碼在Windows上運行良好,但在Ubuntu的問題。在Ubuntu中,代碼有時會運行,有時會出錯。有時代碼運行,有時它會給出錯誤

的誤差小於:

Traceback (most recent call last): 
    File "Craftsvilla.py", line 22, in <module> 
    source = requests.get(new_url) 
    File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 70, in get 
    return request('get', url, params=params, **kwargs) 
    File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 56, in request 
    return session.request(method=method, url=url, **kwargs) 
    File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 488, in request 
    resp = self.send(prep, **send_kwargs) 
    File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 609, in send 
    r = adapter.send(request, **kwargs) 
    File "/usr/local/lib/python2.7/dist-packages/requests/adapters.py", line 487, in send 
    raise ConnectionError(e, request=request) 
requests.exceptions.ConnectionError: HTTPConnectionPool(host='www.craftsvilla.com', port=80): Max retries exceeded with url: /shop/01-princess-ayesha-cotton-salwar-suit-for-rudra-house/5601472 (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7f6685fc3310>: Failed to establish a new connection: [Errno -2] Name or service not known',)) 

下面是我的代碼:

import requests 
import lxml 
from bs4 import BeautifulSoup 
import xlrd 
import xlwt 

file_location = "/home/nitink/Python Linux/BeautifulSoup/Craftsvilla/Craftsvilla.xlsx" 

workbook = xlrd.open_workbook(file_location) 

sheet = workbook.sheet_by_index(0) 

products = [] 
for r in range(sheet.nrows): 
    products.append(sheet.cell_value(r,0)) 

book = xlwt.Workbook(encoding= "utf-8", style_compression = 0) 
sheet = book.add_sheet("Sheet11", cell_overwrite_ok=True) 

for index, url in enumerate(products): 
    new_url = "http://www." + url 
    source = requests.get(new_url) 
    data = source.content 
    soup = BeautifulSoup(data, "lxml") 

    sheet.write(index, 0, url) 

    try: 
     Product_Name = soup.select(".product-title")[0].text.strip() 
     sheet.write(index, 1, Product_Name) 

    except Exception: 
     sheet.write(index, 1, "") 

book.save("Craftsvilla Output.xls") 

保存以下鏈接爲Craftsvilla.xlsx

craftsvilla.com/shop/01-princess-ayesha-cotton-salwar-suit-for-rudra-house/5601472 
craftsvilla.com/shop/3031-pista-prachi/3715170 
craftsvilla.com/shop/795-peach-colored-stright-salwar-suit/5608295 
craftsvilla.com/catalog/product/view/id/5083511/s/dharm-fashion-villa-embroidery-navy-blue-slawar-suit-gown 

注意:對於一些人的代碼將運行,但嘗試一段時間..相同的代碼會給出錯誤..不知道爲什麼?? ..和相同的代碼將永遠不會給任何錯誤在Windows上。

+0

我覺得你在較短的時間內發送來自同一IP地址的請求數量過多,因此,服務器可能會拒絕您的連接。 –

+0

但爲什麼相同的代碼不會給窗口上的錯誤。 – Nitin

+0

在'new_url'之後加上'print(new_url)',我認爲你讀了xlsx文件並得到了不完整的數據。 –

回答

2

看起來你好像是太經常擊中該網站,服務器拒絕你的請求。爲good web-scraping citizen,並添加後續請求之間的時間延遲:

import time 

for index, url in enumerate(products): 
    new_url = "http://www." + url 
    source = requests.get(new_url) 
    data = source.content 
    soup = BeautifulSoup(data, "lxml") 

    # ... 

    time.sleep(1) # one second delay