2017-03-06 103 views
0

我想要做的是刮掉公司信息(thisisavailable.eu.pn/company.html)並添加到董事會字典中所有董事會成員與他們各自的數據來自不同的頁面。如何使用Scrapy從多個鏈接頁面抓取和抓取一組數據

所以最好是我回來從樣本頁面上的數據將是:

{ 
    "company": "Mycompany Ltd", 
    "code": "3241234", 
    "phone": "2323232", 
    "email": "[email protected]", 
    "board": { 
     "1": { 
      "name": "Margaret Sawfish", 
      "code": "9999999999" 
     }, 
     "2": { 
      "name": "Ralph Pike", 
      "code": "222222222" 
     } 
    } 
} 

我已搜查谷歌和SO(如herehereScrapy docs等),但一直沒能找到一個解決方案問題完全是這樣的。

我已經能夠湊齊:

items.py:

import scrapy 
class company_item(scrapy.Item): 
    name = scrapy.Field() 
    code = scrapy.Field() 
    board = scrapy.Field() 
    phone = scrapy.Field() 
    email = scrapy.Field() 
    pass 

class person_item(scrapy.Item): 
    name = scrapy.Field() 
    code = scrapy.Field()  
    pass 

蜘蛛/ example.py:

import scrapy 
from try.items import company_item,person_item 

class ExampleSpider(scrapy.Spider): 
    name = "example" 
    #allowed_domains = ["http://thisisavailable.eu.pn"] 
    start_urls = ['http://thisisavailable.eu.pn/company.html'] 

    def parse(self, response): 
     if response.xpath("//table[@id='company']"): 
      yield self.parse_company(response) 
      pass 
     elif response.xpath("//table[@id='person']"): 
      yield self.parse_person(response) 
      pass   
     pass 

    def parse_company(self, response): 
     Company = company_item() 
     Company['name'] = response.xpath("//table[@id='company']/tbody/tr[1]/td[2]/text()").extract_first() 
     Company['code'] = response.xpath("//table[@id='company']/tbody/tr[2]/td[2]/text()").extract_first() 
     board = [] 

     for person_row in response.xpath("//table[@id='board']/tbody/tr/td[1]"): 
      Person = person_item() 
      Person['name'] = person_row.xpath("a/text()").extract() 
      print (person_row.xpath("a/@href").extract_first()) 
      request = scrapy.Request('http://thisisavailable.eu.pn/'+person_row.xpath("a/@href").extract_first(), callback=self.parse_person) 
      request.meta['Person'] = Person 
      return request   
      board.append(Person) 

     Company['board'] = board 
     return Company  

    def parse_person(self, response):  
     print('PERSON!!!!!!!!!!!') 
     print (response.meta) 
     Person = response.meta['Person'] 
     Person['name'] = response.xpath("//table[@id='person']/tbody/tr[1]/td[2]/text()").extract_first() 
     Person['code'] = response.xpath("//table[@id='person']/tbody/tr[2]/td[2]/text()").extract_first() 
     yield Person 

UPDATE: 拉斐爾發現,最初的問題與allowed_domains是錯誤的 - 我暫時評論它,現在當我運行它,我得到(由於代表低代表添加*的網址):

scrapy抓取例如2017年3月7日9時41分12秒[scrapy.utils.log] INFO:Scrapy 1.3.2開始(BOT:proov)2017年3月7日9時41分12秒 [scrapy.utils.log]信息:重寫設置:{'NEWSPIDER_MODULE': 'proov.spiders','SPIDER_MODULES':['proov.spiders'], 'ROBOTSTXT_OBEY':True,'BOT_NAME':'proov' } 2017年3月7日九點41分12秒 [scrapy.middleware] INFO:啓用擴展: [ 'scrapy.extensions.logstats.LogStats', 'scrapy.extensions.telnet.TelnetConsole', 「scrapy.extensions .corestats.CoreStats'] 2017-03-07 09:41:13 [scrapy.middleware]信息:啓用下載m iddlewares: [ 'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 「scrapy .downloadermiddlewares.useragent.UserAgentMiddleware ' 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', ' scrapy.downloadermiddlewares.redirect.RedirectMiddleware ', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2017-03-07 09:41:13 [scrapy.middleware]信息:啓用蜘蛛中間件: ['scrapy.spidermiddlewares .httperror.HttpErrorMiddleware」, 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2017-03-07 09:41:13 [scrapy.middleware]信息:啓用項目管道:[] 2017-03-07 09:41:13 [scrapy.core.engine]信息:蜘蛛打開2017-03- 07 09:41:13 [scrapy.extensions。logstats]信息:檢索0頁(在0頁/分鐘), 刮0項(在0項/分鐘)2017-03-07 09:41:13 [scrapy.extensions.telnet] DEBUG:Telnet控制檯偵聽 127.0.0.1:6023 2017-03-07 09:41:14 [scrapy.core.engine] DEBUG:Crawled(404)http://*thisisavailable.eu.pn/robots.txt>(引用者:無) 2017-03-07 09:41:14 [scrapy.core.engine] DEBUG:Crawled(200)http://*thisisavailable.eu.pn/scrapy/company.html>(referer:無) person.html person2 .html 2017-03-07 09:41:15 [scrapy.core.engine] DEBUG:Crawled(200)http://thisisavailable.eu.pn/person2.html> (referer:http:// * thisisvailable .eu.pn/company.html)PERSON !!!!!!!!!!! 2017-03-07 09:41:15 [scrapy.core.scraper] DEBUG:從<中刪除200 http://*thisisavailable.eu.pn/person2.html> {'code':u'222222222', 'name':u'Kaspar K \ xe4nnuotsa'} 2017-03-07 09:41:15 [scrapy.core.engine]信息:關閉蜘蛛(完成)2017-03-07 09:41:15 [ scrapy.statscollectors]信息:傾銷Scrapy統計數據: {'downloader/request_bytes':936,'downloader/request_count':3, 'downloader/request_method_count/GET':3, 'downloader/response_bytes':1476,'downloader/response_count':3, 'downloader/response_status_count/200':2, 'downloader/response_status_count/404':1,'finish_reason': 'finished','finish_tim e':datetime.datetime(2017,3,7,7,41,15, 571000),'item_scraped_count':1,'log_count/DEBUG':5, 'log_count/INFO':7,'request_depth_max': 1, 'response_received_count':3, '調度器/出列':2, '調度器/出列/存儲器':2, '調度器/排隊':2, '調度器/入隊/存儲器':2 'START_TIME' :datetime.datetime(2017年,3 ,7,7,41,13,404000)} 2017年3月7日九時41分15秒[scrapy.core.engine] INFO:蜘蛛閉合(成品)

如果使用「-o file.json」運行,則文件內容爲:

[{ 「代碼」: 「222222222」, 「名」: 「拉爾夫·派克」}]


所以遠一點,但我仍然在虧損如何使它工作。

有人可以幫我做這個工作嗎?

回答

1

您的問題與多個項目無關,即使它將在未來。

您的問題是在輸出解釋

[scrapy.spidermiddlewares.offsite] DEBUG:過濾器異地請求 'kidplay-wingsuit.c9users.io':http://thisisavailable.eu.pn/scrapy/person2.html> 2017年3月6日10時44分:33

這意味着要進入allowed_domains列表之外的域。

您允許的域名有誤。它應該是

allowed_domains = ["thisisavailable.eu.pn"] 

注:

而不是使用一個不同的項目爲Person只是用它作爲一個字段中Company和分配dictlist到它,而刮

+0

致謝,拉斐爾 - 更新這個問題。 – Esu