使用Scrapy與RabbitMQ的

我試圖使用Scrapy與使用RabbitMQ的消費者消費。使用Scrapy與RabbitMQ的

這裏是我的代碼片段：

def runTester(body): 
    spider = MySpider(domain=body["url"], body=body) 
    settings = get_project_settings() 
    crawler = Crawler(settings) 
    crawler.signals.connect(reactor.stop, signal=signals.spider_closed) 
    crawler.configure() 
    crawler.crawl(spider) 
    crawler.start() 
    log.start() 
    reactor.run() 


def callback(ch, method, properties, body): 
    body = json.loads(body) 
    runTester(body) 
    ch.basic_ack(delivery_tag=method.delivery_tag) 

if __name__ == '__main__': 
    connection = pika.BlockingConnection(pika.ConnectionParameters(host=settings.RABBITMQ_HOST)) 
    channel = connection.channel() 
    channel.queue_declare(queue=settings.RABBITMQ_TESTER_QUEUE, durable=True) 
    channel.basic_qos(prefetch_count=1) 
    channel.basic_consume(callback, queue=settings.RABBITMQ_TESTER_QUEUE) 
    channel.start_consuming()

正如你所看到的問題是反應器停工時，第一條消息被消耗和蜘蛛運行。這是什麼解決方法？

我希望能夠保持反應堆的運行，同時繼續運行新的爬蟲的消息從RabbitMQ的收到的所有時間。

來源

2013-12-15 neeagl

一個更好的辦法是使用scrapy daemon API推出的蜘蛛，在得到一個蜘蛛請求，那麼您將使用curl這樣的：

reply = {} 
args = ['curl', 
     'http://localhost:6800/schedule.json', 
     '-d', 'project=myproject', ] + flat_args 
json_reply = subprocess.Popen(args, stdout=subprocess.PIPE).communicate()[0] 
try: 
    reply = json.loads(json_reply) 
    if reply['status'] != 'ok': 
     logger.error('Error in spider: %r: %r.', args, reply) 
    else: 
     logger.debug('Started spider: %r: %r.', args, reply) 
except Exception: 
    logger.error('Error starting spider: %r: %r.', args, json_reply) 
return reply

什麼將啓動一個子進程，真正做到：

$ curl http://localhost:6800/schedule.json -d project=myproject -d spider=somespider

scrapy守護建管理蜘蛛發射並具有其他許多有用的功能，如在使用一個簡單的命令scrapy deploy部署新版本的蜘蛛，監控和平衡多個蜘蛛等

來源

2013-12-16 01:09:56

這可行，但它不是即刻運行scrapy過程，而是在一段時間後運行。你可以讓我知道我們如何在安排後立即運行蜘蛛？ – neeagl

不，我的錯誤。它的工作正常。 :)感謝您的建議。 :) – neeagl

有關從RabbitMQ消費的問題，而不是關於向jobs提交工作的問題。我也想要使用RabydMQ的優秀例子。 – Chris

使用Scrapy與RabbitMQ的

回答

相關問題