2016-04-29 68 views
0

我正在從國家美術館的在線目錄中檢索信息。由於目錄的結構,我不能通過從入口到入口的提取和跟蹤鏈接進行導航。幸運的是,集合中的每個對象都有一個可預測的url。我希望我的蜘蛛通過生成啓動URL來導航收藏。生成開始網址時的衝突

我試圖通過從this線程實施解決方案來解決我的問題。不幸的是,這似乎打破了我的蜘蛛的另一部分。錯誤日誌顯示我的網址正在成功生成,但它們未被正確處理。如果我正確地解釋日誌 - 我懷疑我沒有這樣做 - 重新定義start_urls可以讓我生成我需要的url和蜘蛛的規則部分之間存在衝突。就目前情況來看,蜘蛛也不會考慮我要求它抓取的頁面數量。

你會在下面找到我的蜘蛛和一個典型的錯誤。我很感激您可以提供的任何幫助。

蜘蛛:

URL = "http://www.nga.gov/content/ngaweb/Collection/art-object-page.%d" 
starting_number = 1312 
number_of_pages = 10 
class NGASpider(CrawlSpider): 
    name = 'ngamedallions' 
    allowed_domains = ['nga.gov'] 
    start_urls = [URL % starting_number] 
    rules = (
      Rule(LinkExtractor(allow=('art-object-page.*','objects/*')),callback='parse_CatalogRecord', 
follow=True)) 

    def __init__(self): 
     self.page_number = starting_number 

    def start_requests(self): 
     for i in range (self.page_number, number_of_pages, -1): 
      yield Request(url = URL % i + ".html" , callback=self.parse) 


    def parse_CatalogRecord(self, response): 
     CatalogRecord = ItemLoader(item=NgamedallionsItem(), response=response) 
     CatalogRecord.default_output_processor = TakeFirst() 
     CatalogRecord.image_urls_out = scrapy.loader.processors.Identity() 
     keywords = "medal|medallion" 
     r = re.compile('.*(%s).*' % keywords, re.IGNORECASE|re.MULTILINE|re.UNICODE) 
     if r.search(response.body_as_unicode()): 

      CatalogRecord.add_xpath('title', './/dl[@class="artwork-details"]/dt[@class="title"]/text()') 
      CatalogRecord.add_xpath('accession', './/dd[@class="accession"]/text()') 
      CatalogRecord.add_xpath('inscription', './/div[@id="inscription"]/p/text()') 
      CatalogRecord.add_xpath('image_urls', './/img[@class="mainImg"]/@src') 

      return CatalogRecord.load_item() 

典型錯誤:

2016-04-29 15:35:00 [scrapy] ERROR: Spider error processing <GET http://www.nga.gov/content/ngaweb/Collection/art-object-page.1178.html> (referer: None) 
Traceback (most recent call last): 
    File "/usr/lib/pymodules/python2.7/scrapy/utils/defer.py", line 102, in iter_errback 
yield next(it) 
    File "/usr/lib/pymodules/python2.7/scrapy/spidermiddlewares/offsite.py", line 28, in process_spider_output 
    for x in result: 
    File "/usr/lib/pymodules/python2.7/scrapy/spidermiddlewares/referer.py", line 22, in <genexpr> 
    return (_set_referer(r) for r in result or()) 
    File "/usr/lib/pymodules/python2.7/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr> 
    return (r for r in result or() if _filter(r)) 
    File "/usr/lib/pymodules/python2.7/scrapy/spidermiddlewares/depth.py", line 54, in <genexpr> 
    return (r for r in result or() if _filter(r)) 
    File "/usr/lib/pymodules/python2.7/scrapy/spiders/crawl.py", line 73, in _parse_response 
    for request_or_item in self._requests_to_follow(response): 
    File "/usr/lib/pymodules/python2.7/scrapy/spiders/crawl.py", line 51, in _requests_to_follow 
    for n, rule in enumerate(self._rules): 
AttributeError: 'NGASpider' object has no attribute '_rules' 

更新在Resonse到eLRuLL的解決方案

只需去除和start_urls使我蜘蛛抓取我生成的URL。但是,它似乎也阻止應用「def parse_CatalogRecord(self,response)」。當我現在運行蜘蛛時,它只會從生成的URL的範圍之外刮取頁面。我的修改蜘蛛和日誌輸出如下。

蜘蛛:

URL = "http://www.nga.gov/content/ngaweb/Collection/art-object-page.%d" 
starting_number = 1312 
number_of_pages = 1311 
class NGASpider(CrawlSpider): 
    name = 'ngamedallions' 
    allowed_domains = ['nga.gov'] 
    rules = (
      Rule(LinkExtractor(allow=('art-object-page.*','objects/*')),callback='parse_CatalogRecord', 
follow=True)) 

    def start_requests(self): 
     self.page_number = starting_number 
     for i in range (self.page_number, number_of_pages, -1): 
      yield Request(url = URL % i + ".html" , callback=self.parse) 


    def parse_CatalogRecord(self, response): 
     CatalogRecord = ItemLoader(item=NgamedallionsItem(), response=response) 
     CatalogRecord.default_output_processor = TakeFirst() 
     CatalogRecord.image_urls_out = scrapy.loader.processors.Identity() 
     keywords = "medal|medallion" 
     r = re.compile('.*(%s).*' % keywords, re.IGNORECASE|re.MULTILINE|re.UNICODE) 
     if r.search(response.body_as_unicode()): 

      CatalogRecord.add_xpath('title', './/dl[@class="artwork-details"]/dt[@class="title"]/text()') 
      CatalogRecord.add_xpath('accession', './/dd[@class="accession"]/text()') 
      CatalogRecord.add_xpath('inscription', './/div[@id="inscription"]/p/text()') 
      CatalogRecord.add_xpath('image_urls', './/img[@class="mainImg"]/@src') 

      return CatalogRecord.load_item() 

登錄:

2016-05-02 15:50:02 [scrapy] INFO: Scrapy 1.0.5.post4+g4b324a8 started (bot: ngamedallions) 
2016-05-02 15:50:02 [scrapy] INFO: Optional features available: ssl, http11 
2016-05-02 15:50:02 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'ngamedallions.spiders', 'FEED_URI': 'items.json', 'SPIDER_MODULES': ['ngamedallions.spiders'], 'BOT_NAME': 'ngamedallions', 'FEED_FORMAT': 'json', 'DOWNLOAD_DELAY': 3} 
2016-05-02 15:50:02 [scrapy] INFO: Enabled extensions: CloseSpider, FeedExporter, TelnetConsole, LogStats, CoreStats, SpiderState 
2016-05-02 15:50:02 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats 
2016-05-02 15:50:02 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 
2016-05-02 15:50:02 [scrapy] INFO: Enabled item pipelines: ImagesPipeline 
2016-05-02 15:50:02 [scrapy] INFO: Spider opened 
2016-05-02 15:50:02 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2016-05-02 15:50:02 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 
2016-05-02 15:50:02 [scrapy] DEBUG: Crawled (200) <GET http://www.nga.gov/content/ngaweb/Collection/art-object-page.1312.html> (referer: None) 
2016-05-02 15:50:02 [scrapy] DEBUG: Filtered duplicate request: <GET http://www.nga.gov/content/ngaweb/Collection/art-object-page.1312.html> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates) 
2016-05-02 15:50:05 [scrapy] DEBUG: Crawled (200) <GET http://www.nga.gov/content/ngaweb/Collection/art-object-page.1313.html> (referer: http://www.nga.gov/content/ngaweb/Collection/art-object-page.1312.html) 
2016-05-02 15:50:05 [scrapy] DEBUG: File (uptodate): Downloaded image from <GET http://media.nga.gov/public/objects/1/3/1/3/1313-primary-0-440x400.jpg> referred in <None> 
2016-05-02 15:50:05 [scrapy] DEBUG: Scraped from <200 http://www.nga.gov/content/ngaweb/Collection/art-object-page.1313.html> 
{'accession': u'1942.9.163.b', 
'image_urls': [u'http://media.nga.gov/public/objects/1/3/1/3/1313-primary-0-440x400.jpg'], 
'images': [{'checksum': '9d5f2e30230aeec1582ca087bcde6bfa', 
     'path': 'full/3a692347183d26ffefe9ba0af80b0b6bf247fae5.jpg', 
     'url': 'http://media.nga.gov/public/objects/1/3/1/3/1313-primary-0-440x400.jpg'}], 
'inscription': u'around top circumference: TRINACRIA IANI; upper center: PELORVS ; across center: PA LI; across bottom: BELAVRA', 
'title': u'House between Two Hills [reverse]'} 
2016-05-02 15:50:05 [scrapy] INFO: Closing spider (finished) 
2016-05-02 15:50:05 [scrapy] INFO: Stored json feed (1 items) in: items.json 
2016-05-02 15:50:05 [scrapy] INFO: Dumping Scrapy stats: 
{'downloader/request_bytes': 631, 
'downloader/request_count': 2, 
'downloader/request_method_count/GET': 2, 
'downloader/response_bytes': 26324, 
'downloader/response_count': 2, 
'downloader/response_status_count/200': 2, 
'dupefilter/filtered': 3, 
'file_count': 1, 
'file_status_count/uptodate': 1, 
'finish_reason': 'finished', 
'finish_time': datetime.datetime(2016, 5, 2, 19, 50, 5, 810570), 
'item_scraped_count': 1, 
'log_count/DEBUG': 6, 
'log_count/INFO': 8, 
'request_depth_max': 2, 
'response_received_count': 2, 
'scheduler/dequeued': 2, 
'scheduler/dequeued/memory': 2, 
'scheduler/enqueued': 2, 
'scheduler/enqueued/memory': 2, 
'start_time': datetime.datetime(2016, 5, 2, 19, 50, 2, 455508)} 
2016-05-02 15:50:05 [scrapy] INFO: Spider closed (finished) 

回答

1

,如果你不打算叫super不重寫__init__方法。

現在,如果你打算使用start_requests,你並不需要爲你的蜘蛛工作宣佈start_urls

只要刪除您的方法並且不需要start_urls存在。

UPDATE

好了我的錯誤,看起來像CrawlSpider需要start_urls屬性,因此只需創建它,而不是使用start_requests方法:

start_urls = [URL % i + '.html' for i in range (starting_number, number_of_pages, -1)] 

,並刪除start_requests

+0

我想你解決方案的工作,只要它允許我的蜘蛛抓取所有生成的網址,但它也會產生第二個問題。當蜘蛛報告它抓取生成的網址時,它似乎不會應用我的解析方法,除非它跟蹤到輔助頁面的鏈接。 – Tric

+0

請檢查更新的答案 – eLRuLL

+0

完美的作品,謝謝! – Tric