2017-06-21 62 views
1

我是scrapy的新手,試圖用scrapy配置pycharm。在調試程序時出現錯誤。然後,我還嘗試將我的Scrapy項目添加到PyCharm中,如下所示:pycharm scrapy配置

File-> Setting-> Project structure-> Add content root。它不工作

import scrapy 
from scrapy.spiders import SitemapSpider 
from scrapy.spiders import Spider 
from scrapy.http import Request, XmlResponse 
from scrapy.utils.sitemap import Sitemap, sitemap_urls_from_robots 
from scrapy.utils.gz import gunzip, is_gzipped 
import re 
import requests 

class GetpagesfromsitemapSpider(SitemapSpider): 
    name = "test" 
    handle_httpstatus_list = [404] 

    def parse(self, response): 
     print response.url 

    def _parse_sitemap(self, response): 
     if response.url.endswith('/robots.txt'): 
      for url in sitemap_urls_from_robots(response.body): 
       yield Request(url, callback=self._parse_sitemap) 
     else: 
      body = self._get_sitemap_body(response) 
      if body is None: 
       self.logger.info('Ignoring invalid sitemap: %s', response.url) 
       return 

      s = Sitemap(body) 
      sites = [] 
      if s.type == 'sitemapindex': 
       for loc in iterloc(s, self.sitemap_alternate_links): 
        if any(x.search(loc) for x in self._follow): 
         yield Request(loc, callback=self._parse_sitemap) 
      elif s.type == 'urlset': 
       for loc in iterloc(s): 
        for r, c in self._cbs: 
         if r.search(loc): 
          sites.append(loc) 
          break 
      print sites 

    def __init__(self, spider=None, *a, **kw): 
      super(GetpagesfromsitemapSpider, self).__init__(*a, **kw) 
      self.spider = spider 
      l = [] 
      url = "https://channelstore.roku.com" 
      resp = requests.head(url + "/sitemap.xml") 
      if (resp.status_code != 404): 
       l.append(resp.url) 
      else: 
       resp = requests.head(url + "/robots.txt") 
       if (resp.status_code == 200): 
        l.append(resp.url) 
      self.sitemap_urls = l 
      print self.sitemap_urls 

def iterloc(it, alt=False): 
    for d in it: 
     yield d['loc'] 


    # Also consider alternate URLs (xhtml:link rel="alternate") 
     if alt and 'alternate' in d: 
      for l in d['alternate']: 
       yield l 

錯誤:Error report

配置:Pycharm Configuration settings 蜘蛛文件位置:location

回答

1

配置>scipts參數>crawl [spider name]

在你的情況下,更換[spider name],作者testcrawl test

UPDATE

如果您在scrapy項目不是,只是想運行一個單一的文件,你可以運行runspider [/file/path]

您的蜘蛛在你的情況:仍然runspider items.py

+0

不起作用 – Vipin

+0

什麼是錯誤? –

+0

與之前相同的錯誤 – Vipin