2016-09-24 66 views
0

我試圖抓取並刮取多個頁面,給定多個Url。我正在使用維基百科進行測試,並且爲了使它更容易,我只爲每個頁面使用了相同的Xpath選擇器,但我最終希望使用每個頁面獨有的許多不同的Xpath選擇器,因此每個頁面都有其自己的單獨parsePage方法。Scrapy:在多個頁面上使用項目加載程序填充項目

當我不使用物品加載器並直接填充物品時,此代碼完美工作。當我使用項目加載器時,項目被奇怪地填充,並且它似乎完全忽略了在解析方法中分配的回調,並且只使用parsePage方法的start_urls。

import scrapy 
from scrapy.http import Request 
from scrapy import Spider, Request, Selector 
from testanother.items import TestItems, TheLoader 

class tester(scrapy.Spider): 
name = 'vs' 
handle_httpstatus_list = [404, 200, 300] 
#Usually, I only get data from the first start url 
start_urls = ['https://en.wikipedia.org/wiki/SANZAAR','https://en.wikipedia.org/wiki/2016_Rugby_Championship','https://en.wikipedia.org/wiki/2016_Super_Rugby_season'] 
def parse(self, response): 
    #item = TestItems() 
    l = TheLoader(item=TestItems(), response=response) 
    #when I use an item loader, the url in the request is completely ignored. without the item loader, it works properly. 
    request = Request("https://en.wikipedia.org/wiki/2016_Rugby_Championship", callback=self.parsePage1, meta={'loadernext':l}, dont_filter=True) 
    yield request 

    request = Request("https://en.wikipedia.org/wiki/SANZAAR", callback=self.parsePage2, meta={'loadernext1': l}, dont_filter=True) 
    yield request 

    yield Request("https://en.wikipedia.org/wiki/2016_Super_Rugby_season", callback=self.parsePage3, meta={'loadernext2': l}, dont_filter=True) 

def parsePage1(self,response): 
    loadernext = response.meta['loadernext'] 
    loadernext.add_xpath('title1', '//*[@id="firstHeading"]/text()') 
    return loadernext.load_item() 
#I'm not sure if this return and load_item is the problem, because I've tried yielding/returning to another method that does the item loading instead and the first start url is still the only url scraped. 
def parsePage2(self,response): 
    loadernext1 = response.meta['loadernext1'] 
    loadernext1.add_xpath('title2', '//*[@id="firstHeading"]/text()') 
    return loadernext1.load_item() 

def parsePage3(self,response): 
    loadernext2 = response.meta['loadernext2'] 
    loadernext2.add_xpath('title3', '//*[@id="firstHeading"]/text()') 
    return loadernext2.load_item() 

這裏的結果,當我不使用項目裝載機:

{'title1': [u'2016 Rugby Championship'], 
'title': [u'SANZAAR'], 
'title3': [u'2016 Super Rugby season']} 

這裏的一個位的對數與項目裝載機:

{'title2': u'SANZAAR'} 
2016-09-24 14:30:43 [scrapy] DEBUG: Crawled (200) <GET https://en.wikipedia.org/wiki/2016_Rugby_Championship> (referer: https://en.wikipedia.org/wiki/SANZAAR) 
2016-09-24 14:30:43 [scrapy] DEBUG: Crawled (200) <GET https://en.wikipedia.org/wiki/2016_Rugby_Championship> (referer: https://en.wikipedia.org/wiki/2016_Rugby_Championship) 
2016-09-24 14:30:43 [scrapy] DEBUG: Scraped from <200 https://en.wikipedia.org/wiki/2016_Super_Rugby_season> 
{'title2': u'SANZAAR', 'title3': u'SANZAAR'} 
2016-09-24 14:30:43 [scrapy] DEBUG: Crawled (200) <GET https://en.wikipedia.org/wiki/SANZAAR> (referer: https://en.wikipedia.org/wiki/2016_Rugby_Championship) 
2016-09-24 14:30:43 [scrapy] DEBUG: Crawled (200) <GET https://en.wikipedia.org/wiki/2016_Rugby_Championship> (referer: https://en.wikipedia.org/wiki/2016_Super_Rugby_season) 
2016-09-24 14:30:43 [scrapy] DEBUG: Crawled (200) <GET https://en.wikipedia.org/wiki/2016_Super_Rugby_season> (referer: https://en.wikipedia.org/wiki/2016_Rugby_Championship) 
2016-09-24 14:30:43 [scrapy] DEBUG: Crawled (200) <GET https://en.wikipedia.org/wiki/2016_Super_Rugby_season> (referer: https://en.wikipedia.org/wiki/2016_Super_Rugby_season) 
2016-09-24 14:30:43 [scrapy] DEBUG: Scraped from <200 https://en.wikipedia.org/wiki/2016_Rugby_Championship> 
{'title1': u'SANZAAR', 'title2': u'SANZAAR', 'title3': u'SANZAAR'} 
2016-09-24 14:30:43 [scrapy] DEBUG: Scraped from <200 https://en.wikipedia.org/wiki/2016_Rugby_Championship> 
{'title1': u'2016 Rugby Championship'} 
2016-09-24 14:30:43 [scrapy] DEBUG: Scraped from <200 https://en.wikipedia.org/wiki/SANZAAR> 
{'title1': u'2016 Rugby Championship', 'title2': u'2016 Rugby Championship'} 
2016-09-24 14:30:43 [scrapy] DEBUG: Scraped from <200 https://en.wikipedia.org/wiki/2016_Rugby_Championship> 
{'title1': u'2016 Super Rugby season'} 
2016-09-24 14:30:43 [scrapy] DEBUG: Crawled (200) <GET https://en.wikipedia.org/wiki/SANZAAR> (referer: https://en.wikipedia.org/wiki/2016_Super_Rugby_season) 
2016-09-24 14:30:43 [scrapy] DEBUG: Scraped from <200 https://en.wikipedia.org/wiki/2016_Super_Rugby_season> 
{'title1': u'2016 Rugby Championship', 
'title2': u'2016 Rugby Championship', 
'title3': u'2016 Rugby Championship'} 
2016-09-24 14:30:43 [scrapy] DEBUG: Scraped from <200 https://en.wikipedia.org/wiki/2016_Super_Rugby_season> 
{'title1': u'2016 Super Rugby season', 'title3': u'2016 Super Rugby season'} 
2016-09-24 14:30:43 [scrapy] DEBUG: Scraped from <200 https://en.wikipedia.org/wiki/SANZAAR> 
{'title1': u'2016 Super Rugby season', 
'title2': u'2016 Super Rugby season', 
'title3': u'2016 Super Rugby season'} 
2016-09-24 14:30:43 [scrapy] INFO: Clos 

究竟是怎麼回事錯了呢?謝謝!

回答

4

一個問題是,您正在將多個相同項目加載器實例的引用傳遞給多個回調,例如,在parse中有兩條yield request指令。另外,在接下來的回調中,加載器仍然使用舊的response對象,例如,在parsePage1項目裝載機仍在的response上運行。

在大多數情況下,不建議將項目裝載器傳遞到另一個回調。或者,您可能會發現直接傳遞項目對象會更好。

這裏是一個簡短的(或不完全)例如,通過編輯代碼:

def parse(self, response): 
    l = TheLoader(item=TestItems(), response=response) 
    request = Request(
     "https://en.wikipedia.org/wiki/2016_Rugby_Championship", 
     callback=self.parsePage1, 
     meta={'item': l.load_item()}, 
     dont_filter=True 
    ) 
    yield request 

def parsePage1(self,response): 
    loadernext = TheLoader(item=response.meta['item'], response=response) 
    loadernext.add_xpath('title1', '//*[@id="firstHeading"]/text()') 
    return loadernext.load_item() 
+0

太感謝你了!這解決了我的問題。 –

+0

你的真棒回答迫使我登錄到Stackoverflow upvote你:) – mango