問題以下鏈接Scrapy

試圖讓我的webcrawler抓取從網頁中提取的鏈接。我正在使用Scrapy。我可以使用我的抓取工具成功抓取數據，但無法抓取它。我相信問題出在我的規則部分。 Scrapy新手。感謝您提前幫忙。問題以下鏈接Scrapy

我刮這個網站：

/wiki/index.php/A._Ghani

或

/wiki/index.php/A._Keith_Carreiro

這裏：

http://ballotpedia.org/wiki/index.php/Category:2012_challenger

我試圖按照這個樣子的源代碼的鏈接是我的蜘蛛的代碼：

from scrapy.spider import BaseSpider 
from scrapy.selector import HtmlXPathSelector 
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
from scrapy.contrib.spiders import CrawlSpider,Rule 

from ballot1.items import Ballot1Item 

class Ballot1Spider(CrawlSpider): 
    name = "stewie" 
    allowed_domains = ["ballotpedia.org"] 
    start_urls = [ 
     "http://ballotpedia.org/wiki/index.php/Category:2012_challenger" 
    ] 
    rules = (
     Rule (SgmlLinkExtractor(allow=r'w+'), follow=True), 
     Rule(SgmlLinkExtractor(allow=r'\w{4}/\w+/\w+'), callback='parse') 
    ) 

def parse(self, response): 
    hxs = HtmlXPathSelector(response) 
    sites = hxs.select('*') 
    items = [] 
    for site in sites: 
     item = Ballot1Item() 
     item['candidate'] = site.select('/html/head/title/text()').extract() 
     item['position'] = site.select('//table[@class="infobox"]/tr/td/b/text()').extract() 
     item['controversies'] = site.select('//h3/span[@id="Controversies"]/text()').extract() 
     item['endorsements'] = site.select('//h3/span[@id="Endorsements"]/text()').extract() 
     item['currentposition'] = site.select('//table[@class="infobox"]/tr/td[@style="text-align:center; background-color:red;color:white; font-size:100%; font-weight:bold;"]/text()').extract() 
     items.append(item) 
    return items

來源

2013-02-12 Young Grasshopper

r'w+'是錯誤的（我認爲你的意思r'\w+'）和r'\w{4}/\w+/\w+'看起來不正確也是如此，因爲它不符合你的鏈接（它缺少一個龍頭/）。你爲什麼不試試r'/wiki/index.php/.+'？不要忘記，\w不包括.和其他符號，可以是文章名稱的一部分。

來源

2013-02-12 00:49:36 wRAR

嘿，非常感謝。我現在就試試。 – 2013-02-12 00:52:19

剛剛嘗試了上述對規則的更改。它仍然只報廢我的起始網址。 – 2013-02-12 00:55:20

那你後的鏈接是隻有在這個元素存在：

<div lang="en" dir="ltr" class="mw-content-ltr">

所以，你必須限制XPath來防止外部鏈接：

restrict_xpaths='//div[@id="mw-pages"]/div'

最後，你只是想請按照看起來像/wiki/index.php?title=Category:2012_challenger&pagefrom=Alison+McCoy#mw-pages的鏈接，因此您的最終規則應如下所示：

rules = (
    Rule(
     SgmlLinkExtractor(
      allow=r'&pagefrom=' 
     ), 
     follow=True 
    ), 
    Rule(
     SgmlLinkExtractor(
      restrict_xpaths='//div[@id="mw-pages"]/div', 
      callback='parse' 
     ) 
    ) 
)

來源

2013-02-12 00:55:05 Blender

感謝攪拌機。我馬上試一試。 – 2013-02-12 00:56:53

@YoungGrasshopper：看我的編輯。「允許」規則不正確。 – Blender 2013-02-12 00:58:42

添加此代碼引發了無效的表達式錯誤。它也說不好的字符範圍。這聽起來很熟悉@Blender – 2013-02-12 01:06:42

您正在使用CrawlSpider，回撥爲parse，其中scrapy documentation expressly warns will prevent crawling。

將它重命名爲parse_items之類的東西，你應該沒問題。

來源

2013-02-12 12:00:11 Talvalin

問題以下鏈接Scrapy

回答

相關問題