我正在使用scrapy創建一個示例web抓取器作爲Nameko的依賴提供者,但它不抓取任何頁面。以下是密碼Scrapy Nameko DependencyProvider不抓取頁面
import scrapy
from scrapy import crawler
from nameko import extensions
from twisted.internet import reactor
class TestSpider(scrapy.Spider):
name = 'test_spider'
result = None
def parse(self, response):
TestSpider.result = {
'heading': response.css('h1::text').extract_first()
}
class ScrapyDependency(extensions.DependencyProvider):
def get_dependency(self, worker_ctx):
return self
def crawl(self, spider=None):
spider = TestSpider()
spider.name = 'test_spider'
spider.start_urls = ['http://www.example.com']
self.runner = crawler.CrawlerRunner()
self.runner.crawl(spider)
d = self.runner.join()
d.addBoth(lambda _: reactor.stop())
reactor.run()
return spider.result
def run(self):
if not reactor.running:
reactor.run()
這裏是日誌。
Enabled extensions:
['scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
Enabled item pipelines:
[]
Spider opened
Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
Closing spider (finished)
Dumping Scrapy stats:
{'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 9, 3, 12, 41, 40, 126088),
'log_count/INFO': 7,
'memusage/max': 59650048,
'memusage/startup': 59650048,
'start_time': datetime.datetime(2017, 9, 3, 12, 41, 40, 97747)}
Spider closed (finished)
在日誌中,我們可以看到它沒有抓取單頁,預計抓取一個頁面。
鑑於,如果我創建一個常規的CrawlerRunner
並抓取頁面,我會將預期結果返回爲{'heading': 'Example Domain'}
。下面是代碼:
import scrapy
class TestSpider(scrapy.Spider):
name = 'test_spider'
start_urls = ['http://www.example.com']
result = None
def parse(self, response):
TestSpider.result = {'heading': response.css('h1::text').extract_first()}
def crawl():
runner = crawler.CrawlerRunner()
runner.crawl(TestSpider)
d = runner.join()
d.addBoth(lambda _: reactor.stop())
reactor.run()
if __name__ == '__main__':
crawl()
它已經兩天這個問題掙扎,我無法用scrapy履帶,滑菇dependecy提供商無法抓取頁面時,要弄清楚。請糾正我要出錯的地方。
你想從中獲得什麼?暫時保留實施,您的實際要求是什麼? –
我想這是對nameko服務方法的依賴,這意味着nameko微服務框架將調用ScrapyDependency()。crawl()來處理請求(web抓取請求)並返回結果。問題是這樣使用時不會刮頁面。 –
您正在混合nameko和扭曲的服務器,不知道他們凝膠如何。 –