2017-02-13 87 views
0

IHAVE wriiten在scrapy履帶,但我希望通過使用的主要方法從主功能運行Scrapy履帶

import sys, getopt 
import scrapy 
from scrapy.spiders import Spider 
from scrapy.http import Request 
import re 

class TutsplusItem(scrapy.Item): 
    title = scrapy.Field() 



class MySpider(Spider): 
    name = "tutsplus" 
    allowed_domains = ["bbc.com"] 
    start_urls = ["http://www.bbc.com/"] 

    def __init__(self, *args): 
     try: 
      opts, args = getopt.getopt(args, "hi:o:", ["ifile=", "ofile="]) 
     except getopt.GetoptError: 
      print 'test.py -i <inputfile> -o <outputfile>' 
      sys.exit(2) 

     super(MySpider, self).__init__(self,*args) 



    def parse(self, response): 
    links = response.xpath('//a/@href').extract() 


    # We stored already crawled links in this list 
    crawledLinks = [] 

    # Pattern to check proper link 
    # I only want to get the tutorial posts 
    # linkPattern = re.compile("^\/tutorials\?page=\d+") 


    for link in links: 
     # If it is a proper link and is not checked yet, yield it to the Spider 
     #if linkPattern.match(link) and not link in crawledLinks: 
     if not link in crawledLinks: 
     link = "http://www.bbc.com" + link 
     crawledLinks.append(link) 
     yield Request(link, self.parse) 

    titles = response.xpath('//a[contains(@class, "media__link")]/text()').extract() 
    count=0 
    for title in titles: 
     item = TutsplusItem() 
     item["title"] = title 
     print("Title is : %s" %title) 
     yield item 

而不是使用scrapy runspider Crawler.py ARG1 ARG2啓動crwaling 我想有主要功能的分離類,並從那裏啓動scrapy。如何做到這一點?

回答

0

有不同的方式接近這一點,但我建議如下:

對,將打開一個新的進程,並與你所需要的參數啓動蜘蛛的相同目錄main.py文件。

的main.py文件將有類似以下內容:

import subprocess 
scrapy_command = 'scrapy runspider {spider_name} -a param_1="{param_1}"'.format(spider_name='your_spider', param_1='your_value') 

process = subprocess.Popen(scrapy_command, shell=True) 

有了這個代碼,你只需要打電話給你的主文件。

python main.py 

希望它有幫助。