我想要做的是刮掉公司信息(thisisavailable.eu.pn/company.html)並添加到董事會字典中所有董事會成員與他們各自的數據來自不同的頁面。如何使用Scrapy從多個鏈接頁面抓取和抓取一組數據
所以最好是我回來從樣本頁面上的數據將是:
{
"company": "Mycompany Ltd",
"code": "3241234",
"phone": "2323232",
"email": "[email protected]",
"board": {
"1": {
"name": "Margaret Sawfish",
"code": "9999999999"
},
"2": {
"name": "Ralph Pike",
"code": "222222222"
}
}
}
我已搜查谷歌和SO(如here和here和Scrapy docs等),但一直沒能找到一個解決方案問題完全是這樣的。
我已經能夠湊齊:
items.py:
import scrapy
class company_item(scrapy.Item):
name = scrapy.Field()
code = scrapy.Field()
board = scrapy.Field()
phone = scrapy.Field()
email = scrapy.Field()
pass
class person_item(scrapy.Item):
name = scrapy.Field()
code = scrapy.Field()
pass
蜘蛛/ example.py:
import scrapy
from try.items import company_item,person_item
class ExampleSpider(scrapy.Spider):
name = "example"
#allowed_domains = ["http://thisisavailable.eu.pn"]
start_urls = ['http://thisisavailable.eu.pn/company.html']
def parse(self, response):
if response.xpath("//table[@id='company']"):
yield self.parse_company(response)
pass
elif response.xpath("//table[@id='person']"):
yield self.parse_person(response)
pass
pass
def parse_company(self, response):
Company = company_item()
Company['name'] = response.xpath("//table[@id='company']/tbody/tr[1]/td[2]/text()").extract_first()
Company['code'] = response.xpath("//table[@id='company']/tbody/tr[2]/td[2]/text()").extract_first()
board = []
for person_row in response.xpath("//table[@id='board']/tbody/tr/td[1]"):
Person = person_item()
Person['name'] = person_row.xpath("a/text()").extract()
print (person_row.xpath("a/@href").extract_first())
request = scrapy.Request('http://thisisavailable.eu.pn/'+person_row.xpath("a/@href").extract_first(), callback=self.parse_person)
request.meta['Person'] = Person
return request
board.append(Person)
Company['board'] = board
return Company
def parse_person(self, response):
print('PERSON!!!!!!!!!!!')
print (response.meta)
Person = response.meta['Person']
Person['name'] = response.xpath("//table[@id='person']/tbody/tr[1]/td[2]/text()").extract_first()
Person['code'] = response.xpath("//table[@id='person']/tbody/tr[2]/td[2]/text()").extract_first()
yield Person
UPDATE: 拉斐爾發現,最初的問題與allowed_domains是錯誤的 - 我暫時評論它,現在當我運行它,我得到(由於代表低代表添加*的網址):
scrapy抓取例如2017年3月7日9時41分12秒[scrapy.utils.log] INFO:Scrapy 1.3.2開始(BOT:proov)2017年3月7日9時41分12秒 [scrapy.utils.log]信息:重寫設置:{'NEWSPIDER_MODULE': 'proov.spiders','SPIDER_MODULES':['proov.spiders'], 'ROBOTSTXT_OBEY':True,'BOT_NAME':'proov' } 2017年3月7日九點41分12秒 [scrapy.middleware] INFO:啓用擴展: [ 'scrapy.extensions.logstats.LogStats', 'scrapy.extensions.telnet.TelnetConsole', 「scrapy.extensions .corestats.CoreStats'] 2017-03-07 09:41:13 [scrapy.middleware]信息:啓用下載m iddlewares: [ 'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 「scrapy .downloadermiddlewares.useragent.UserAgentMiddleware ' 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', ' scrapy.downloadermiddlewares.redirect.RedirectMiddleware ', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2017-03-07 09:41:13 [scrapy.middleware]信息:啓用蜘蛛中間件: ['scrapy.spidermiddlewares .httperror.HttpErrorMiddleware」, 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2017-03-07 09:41:13 [scrapy.middleware]信息:啓用項目管道:[] 2017-03-07 09:41:13 [scrapy.core.engine]信息:蜘蛛打開2017-03- 07 09:41:13 [scrapy.extensions。logstats]信息:檢索0頁(在0頁/分鐘), 刮0項(在0項/分鐘)2017-03-07 09:41:13 [scrapy.extensions.telnet] DEBUG:Telnet控制檯偵聽 127.0.0.1:6023 2017-03-07 09:41:14 [scrapy.core.engine] DEBUG:Crawled(404)http://*thisisavailable.eu.pn/robots.txt>(引用者:無) 2017-03-07 09:41:14 [scrapy.core.engine] DEBUG:Crawled(200)http://*thisisavailable.eu.pn/scrapy/company.html>(referer:無) person.html person2 .html 2017-03-07 09:41:15 [scrapy.core.engine] DEBUG:Crawled(200)http://thisisavailable.eu.pn/person2.html> (referer:http:// * thisisvailable .eu.pn/company.html)PERSON !!!!!!!!!!! 2017-03-07 09:41:15 [scrapy.core.scraper] DEBUG:從<中刪除200 http://*thisisavailable.eu.pn/person2.html> {'code':u'222222222', 'name':u'Kaspar K \ xe4nnuotsa'} 2017-03-07 09:41:15 [scrapy.core.engine]信息:關閉蜘蛛(完成)2017-03-07 09:41:15 [ scrapy.statscollectors]信息:傾銷Scrapy統計數據: {'downloader/request_bytes':936,'downloader/request_count':3, 'downloader/request_method_count/GET':3, 'downloader/response_bytes':1476,'downloader/response_count':3, 'downloader/response_status_count/200':2, 'downloader/response_status_count/404':1,'finish_reason': 'finished','finish_tim e':datetime.datetime(2017,3,7,7,41,15, 571000),'item_scraped_count':1,'log_count/DEBUG':5, 'log_count/INFO':7,'request_depth_max': 1, 'response_received_count':3, '調度器/出列':2, '調度器/出列/存儲器':2, '調度器/排隊':2, '調度器/入隊/存儲器':2 'START_TIME' :datetime.datetime(2017年,3 ,7,7,41,13,404000)} 2017年3月7日九時41分15秒[scrapy.core.engine] INFO:蜘蛛閉合(成品)
如果使用「-o file.json」運行,則文件內容爲:
[{ 「代碼」: 「222222222」, 「名」: 「拉爾夫·派克」}]
所以遠一點,但我仍然在虧損如何使它工作。
有人可以幫我做這個工作嗎?
致謝,拉斐爾 - 更新這個問題。 – Esu