2016-07-28 119 views
0

我想抓取使用Scrapy鏈接提取器的多個網站,並按照TRUE(遞歸)..尋找一個解決方案,以設置每個網址在start_urls列表中爬網的時間限制。在Scrapy中,如何爲每個url設置時間限制?

感謝

import scrapy 

class DmozItem(scrapy.Item): 
    title = scrapy.Field() 
    link = scrapy.Field() 
    desc = scrapy.Field() 

class DmozSpider(scrapy.Spider): 
    name = "dmoz" 
    allowed_domains = ["dmoz.org"] 
    start_urls = [ 
     "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", 
     "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/" 
    ] 
    def parse(self, response): 
     for sel in response.xpath('//ul/li'): 
      item = DmozItem() 
      item['title'] = sel.xpath('a/text()').extract() 
      item['link'] = sel.xpath('a/@href').extract() 
      item['desc'] = sel.xpath('text()').extract() 
      yield item 

回答

-1

使用超時對象!

import signal 

class Timeout(object): 
    """Timeout class using ALARM signal.""" 
    class TimeoutError(Exception): 
     pass 

    def __init__(self, sec): 
     self.sec = sec 

    def __enter__(self): 
     signal.signal(signal.SIGALRM, self.raise_timeout) 
     signal.alarm(self.sec) 

    def __exit__(self, *args): 
     signal.alarm(0)# disable alarm 

    def raise_timeout(self, *args): 
     raise Timeout.TimeoutError('TimeoutError') 

然後就可以調用裏面的提取與聲明是這樣的:

with Timeout(10): #10 seconds 
    try: 
     do_what_you_need_to_do 
    except Timeout.TimeoutError: 
     #break, continue or whatever else you may need 
+0

你能使用scrapy分享一個例子?謝謝 – gsuresh92

+0

它與scrapy無關,你可以將它用於你想要的任何東西,只需要將需要時間控制不足的函數調用(或代碼片段)添加到try/except塊中即可完成。 –

+0

如果你發佈你的代碼,我可以給你看。 –

0

您需要使用download_timeout元參數scrapy.Request

要在開始網址中使用了它,你需要重載self.start_requests(self)功能,是這樣的:

def start_requests(self): 
    # 10 seconds for first url 
    yield Request(self.start_urls[0], meta={'donwload_timeout': 10}) 
    # 60 seconds for first url 
    yield Request(self.start_urls[1], meta={'donwload_timeout': 60}) 

你可以閱讀更多關於請求特殊的元鍵在這裏:http://doc.scrapy.org/en/latest/topics/request-response.html#request-meta-special-keys

相關問題