2016-03-07 179 views
2

我對Scrapy以及使用Python都很陌生。在過去,我已經設法得到了一個Scrapy工作的最小例子,但從那以後就沒有用過它。 同時,一個新版本出來了(我認爲我上次使用的版本是0.24),而且我不能在我的生活中找出爲什麼我會收到403錯誤,無論我嘗試抓取哪個網站。當然,我還沒有深入研究中間件和/或管道,但我希望能夠在探索任何進一步之前獲得最小的示例運行。話雖這麼說,這是我當前的代碼:Scrapy返回403錯誤(禁止)

items.py

import scrapy 

class StackItem(scrapy.Item): 
    title = scrapy.Field() 
    url = scrapy.Field() 

stack_spider.py

#derived from https://realpython.com/blog/python/web-scraping-with-scrapy-and-mongodb/ 
from scrapy import Spider 
from scrapy.selector import Selector 
from stack.items import StackItem 

class StackSpider(Spider): 
    handle_httpstatus_list = [403, 404] #kind of out of desperation. Is it serving any purpose? 
    name = "stack" 
    allowed_domains = ["stackoverflow.com"] 
    start_urls = [ 
     "http://stackoverflow.com/questions?pagesize=50&sort=newest", 
    ] 

    def parse(self, response): 
     questions = Selector(response).xpath('//div[@class="summary"]/h3') 

     for question in questions: 
      self.log(question) 
      item = StackItem() 
      item['title'] = question.xpath('a[@class="question-hyperlink"]/text()').extract()[0] 
      item['url'] = question.xpath('a[@class="question-hyperlink"]/@href').extract()[0] 
      yield item 

輸出

(pyplayground) 22:39 ~/stack $ scrapy crawl stack                                
2016-03-07 22:39:38 [scrapy] INFO: Scrapy 1.0.5 started (bot: stack)                           
2016-03-07 22:39:38 [scrapy] INFO: Optional features available: ssl, http11                         
2016-03-07 22:39:38 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'stack.spiders', 'SPIDER_MODULES': ['stack.spiders'], 'RETRY_TIMES': 5, 'BOT_NAME': 'stack', 'RET 
RY_HTTP_CODES': [500, 502, 503, 504, 400, 403, 404, 408], 'DOWNLOAD_DELAY': 3}                         
2016-03-07 22:39:39 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState               
2016-03-07 22:39:39 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddlewa 
re, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, HttpProxyMiddleware, ChunkedTransferMiddleware, DownloaderStats     
2016-03-07 22:39:39 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware     
2016-03-07 22:39:39 [scrapy] INFO: Enabled item pipelines:                              
2016-03-07 22:39:39 [scrapy] INFO: Spider opened                                
2016-03-07 22:39:39 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)                   
2016-03-07 22:39:39 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023                         
2016-03-07 22:39:39 [scrapy] DEBUG: Retrying <GET http://stackoverflow.com/questions?pagesize=50&sort=newest> (failed 1 times): 403 Forbidden         
2016-03-07 22:39:42 [scrapy] DEBUG: Retrying <GET http://stackoverflow.com/questions?pagesize=50&sort=newest> (failed 2 times): 403 Forbidden         
2016-03-07 22:39:47 [scrapy] DEBUG: Retrying <GET http://stackoverflow.com/questions?pagesize=50&sort=newest> (failed 3 times): 403 Forbidden         
2016-03-07 22:39:51 [scrapy] DEBUG: Retrying <GET http://stackoverflow.com/questions?pagesize=50&sort=newest> (failed 4 times): 403 Forbidden         
2016-03-07 22:39:55 [scrapy] DEBUG: Retrying <GET http://stackoverflow.com/questions?pagesize=50&sort=newest> (failed 5 times): 403 Forbidden         
2016-03-07 22:39:58 [scrapy] DEBUG: Gave up retrying <GET http://stackoverflow.com/questions?pagesize=50&sort=newest> (failed 6 times): 403 Forbidden       
2016-03-07 22:39:58 [scrapy] DEBUG: Crawled (403) <GET http://stackoverflow.com/questions?pagesize=50&sort=newest> (referer: None)            
2016-03-07 22:39:58 [scrapy] INFO: Closing spider (finished)                             
2016-03-07 22:39:58 [scrapy] INFO: Dumping Scrapy stats:                              
{'downloader/request_bytes': 1488,                                    
'downloader/request_count': 6,                                    
'downloader/request_method_count/GET': 6,                                  
'downloader/response_bytes': 6624,                                   
'downloader/response_count': 6,                                    
'downloader/response_status_count/403': 6,                                 
'finish_reason': 'finished',                                     
'finish_time': datetime.datetime(2016, 3, 7, 22, 39, 58, 458578),                            
'log_count/DEBUG': 8,                                       
'log_count/INFO': 7,                                       
'response_received_count': 1,                                     
'scheduler/dequeued': 6,                                      
'scheduler/dequeued/memory': 6,                                    
'scheduler/enqueued': 6,                                      
'scheduler/enqueued/memory': 6,                                    
'start_time': datetime.datetime(2016, 3, 7, 22, 39, 39, 607472)}                            
2016-03-07 22:39:58 [scrapy] INFO: Spider closed (finished) 
+0

你使用的是什麼版本,它只是因爲你有問題? –

+0

哦,對不起。忘了提到這一點。我在Ubuntu上使用Scrapy'1.0.5'('Linux 3.13.0-76-generic#120-Ubuntu SMP Mon Jan 18 15:59:10 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux') – w00t

+1

嘗試添加用戶代理在您的設置文件中。像'USER_AGENT ='Mozilla/5.0(Windows NT 6.1; WOW64)AppleWebKit/535.7(KHTML,像Gecko)Chrome/16.0.912.36 Safari/535.7'' – Rahul

回答

2

最絕的是你身後的代理。檢查並適當設置您的http_proxy,https_proxy環境變量。如果可以從終端獲取該URL,請與curl進行交叉檢查。

+0

你絕對正確。事實證明,我正在研究一個遠程環境,並完全忘記了這個細節。嘗試cURL是我應該首先完成的。 – w00t