2014-12-04 51 views
2

我在scrapy,python中使用站點地圖蜘蛛。 站點地圖似乎在網址前面帶 '//' 不尋常的格式:在scrapy中使用站點地圖蜘蛛解析位於不同網址格式的站點地圖中的網址python

<url> 
    <loc>//www.example.com/10/20-baby-names</loc> 
</url> 
<url> 
    <loc>//www.example.com/elizabeth/christmas</loc> 
</url> 

myspider.py

from scrapy.contrib.spiders import SitemapSpider 
from myspider.items import * 

class MySpider(SitemapSpider): 
    name = "myspider" 
    sitemap_urls = ["http://www.example.com/robots.txt"] 

    def parse(self, response): 
     item = PostItem()   
     item['url'] = response.url 
     item['title'] = response.xpath('//title/text()').extract() 

     return item 

我收到此錯誤:

raise ValueError('Missing scheme in request url: %s' % self._url) 
    exceptions.ValueError: Missing scheme in request url: //www.example.com/10/20-baby-names 

哪有我手動解析網址使用網站地圖蜘蛛?

回答

1

我認爲最好的和乾淨的解決辦法是增加一個下載中間件這改變了惡意URL沒有注意到蜘蛛。

import re 
import urlparse 
from scrapy.http import XmlResponse 
from scrapy.utils.gz import gunzip, is_gzipped 
from scrapy.contrib.spiders import SitemapSpider 

# downloader middleware 
class SitemapWithoutSchemeMiddleware(object): 
    def process_response(self, request, response, spider): 
     if isinstance(spider, SitemapSpider): 
      body = self._get_sitemap_body(response) 

      if body: 
       scheme = urlparse.urlsplit(response.url).scheme 
       body = re.sub(r'<loc>\/\/(.+)<\/loc>', r'<loc>%s://\1</loc>' % scheme, body)  
       return response.replace(body=body) 

     return response 

    # this is from scrapy's Sitemap class, but sitemap is 
    # only for internal use and it's api can change without 
    # notice 
    def _get_sitemap_body(self, response): 
     """Return the sitemap body contained in the given response, or None if the 
     response is not a sitemap. 
     """ 
     if isinstance(response, XmlResponse): 
      return response.body 
     elif is_gzipped(response): 
      return gunzip(response.body) 
     elif response.url.endswith('.xml'): 
      return response.body 
     elif response.url.endswith('.xml.gz'): 
      return gunzip(response.body) 
2

如果我看到它正確,您可以(爲快速解決方案)覆蓋SitemapSpider中的默認實現_parse_sitemap。這不是很好,因爲你將不得不復制大量的代碼,但應該工作。 您必須添加一個方法來生成一個帶有方案的URL。

"""if the URL starts with // take the current website scheme and make an absolute 
URL with the same scheme""" 
def _fix_url_bug(url, current_url): 
    if url.startswith('//'): 
      ':'.join((urlparse.urlsplit(current_url).scheme, url)) 
     else: 
      yield url 

def _parse_sitemap(self, response): 
    if response.url.endswith('/robots.txt'): 
     for url in sitemap_urls_from_robots(response.body) 
      yield Request(url, callback=self._parse_sitemap) 
    else: 
     body = self._get_sitemap_body(response) 
     if body is None: 
      log.msg(format="Ignoring invalid sitemap: %(response)s", 
        level=log.WARNING, spider=self, response=response) 
      return 

     s = Sitemap(body) 
     if s.type == 'sitemapindex': 
      for loc in iterloc(s): 
       # added it before follow-test, to allow test to return true 
       # if it includes the scheme (yet do not know if this is the better solution) 
       loc = _fix_url_bug(loc, response.url) 
       if any(x.search(loc) for x in self._follow): 
        yield Request(loc, callback=self._parse_sitemap) 
     elif s.type == 'urlset': 
      for loc in iterloc(s): 
       loc = _fix_url_bug(loc, response.url) # same here 
       for r, c in self._cbs: 
        if r.search(loc): 
         yield Request(loc, callback=c) 
         break 

這只是一個普遍的想法,未經測試。所以它可能都完全不工作或可能有語法錯誤。請通過評論回覆,這樣我可以改進我的答案。

您試圖解析的站點地圖似乎是錯誤的。從RFC缺少方案is perfectly fine,但站點地圖require URLs to begin with a scheme

+0

不工作,'iterloc'和'SitemapSpider'是同一個模塊,並且像這樣重新定義它並不會改變任何東西。我正在研究這一點 - 一個「模擬」可能是有幫助的。 – alecxe 2014-12-04 15:30:54

+0

你能詳細說一下嗎?如果他有一個「MySpider」實例正在運行,那麼不應該調用「MySpider」的'iterloc'方法嗎?我有一種感覺,你在引用我的想法中的多態性問題? – Aufziehvogel 2014-12-04 15:35:13

+1

問題是'iterloc'不是'SitemapSpider'中的一個方法 - 它是類之外的一個單獨的函數,但在同一個模塊中。 – alecxe 2014-12-04 15:49:52

1

我使用@alecxe的技巧來解析蜘蛛內的網址。我做了它的工作,但不知道這是否是最好的方式。

from urlparse import urlparse 
import re 
from scrapy.spider import BaseSpider 
from scrapy.http import Request 
from scrapy.utils.response import body_or_str 
from example.items import * 

class ExampleSpider(BaseSpider): 
    name = "example" 
    start_urls = ["http://www.example.com/sitemap.xml"] 

    def parse(self,response): 
     nodename = 'loc' 
     text = body_or_str(response) 
     r = re.compile(r"(<%s[\s>])(.*?)(</%s>)" % (nodename, nodename), re.DOTALL) 
     for match in r.finditer(text): 
      url = match.group(2) 
      if url.startswith('//'): 
       url = 'http:'+url 
       yield Request(url, callback=self.parse_page) 

    def parse_page(self, response): 
     # print response.url 
     item = PostItem() 

     item['url'] = response.url 
     item['title'] = response.xpath('//title/text()').extract() 
     return item