廣東話爬行scrapy超過1

我無法配置scrapy與深度> 1跑，我曾嘗試以下3個選項，其中有沒有人曾在總結日誌request_depth_max總是1：廣東話爬行scrapy超過1

1）添加：

from scrapy.conf import settings 
settings.overrides['DEPTH_LIMIT'] = 2

蜘蛛文件（在現場的實例中，只是用不同的位點）

2）-s選項運行命令行：

/usr/bin/scrapy crawl -s DEPTH_LIMIT=2 mininova.org

3）添加到settings.py和scrapy.cfg：

DEPTH_LIMIT=2

究竟應該如何配置超過1？

來源

2012-08-14 user555757

默認值DEPTH_LIMIT的設置是0 - 即「沒有限制」。

您寫道：

request_depth_max在總結日誌始終1

什麼你在日誌中看到的是統計數據，而不是設置。當它表示request_depth_max爲1這意味着從第一個回調中沒有其他請求已經產生。

你必須展示你的蜘蛛代碼，以瞭解發生了什麼。

但是爲它創造另一個問題。

UPDATE：

啊，我看你正在運行mininova蜘蛛爲scrapy intro：

class MininovaSpider(CrawlSpider): 

    name = 'mininova.org' 
    allowed_domains = ['mininova.org'] 
    start_urls = ['http://www.mininova.org/today'] 
    rules = [Rule(SgmlLinkExtractor(allow=['/tor/\d+']), 'parse_torrent')] 

    def parse_torrent(self, response): 
     x = HtmlXPathSelector(response) 

     torrent = TorrentItem() 
     torrent['url'] = response.url 
     torrent['name'] = x.select("//h1/text()").extract() 
     torrent['description'] = x.select("//div[@id='description']").extract() 
     torrent['size'] = x.select("//div[@id='info-left']/p[2]/text()[2]").extract() 
     return torrent

正如你從代碼中看到，蜘蛛從來沒有發出其他頁面的任何請求，它刮掉所有來自頂層頁面的數據。這就是爲什麼最大深度爲1

如果你把你自己的蜘蛛，這將是繼鏈接到其他網頁，最大深度將大於1

來源

2012-08-15 04:56:44 warvariuc

@ warwaruk：「蜘蛛從來沒有發出其他頁面的任何請求」，但MininovaSpider擴展CrawlSpider是用'rules'每一頁上遞歸湊更多的頁面，所以往往沒有必要手動發出請求。 – 2012-08-15 17:46:55

我沒有使用'CrawlSpider'，我不知道它是否遞歸。但在我引用的具體例子中，蜘蛛並沒有深入請求。 – warvariuc 2012-08-15 17:51:32

默認情況下，它會盡可能深入地進入每個頁面，並用'rules'來挖掘它找到的所有鏈接。它在深度1處停止的原因是，實際上沒有鏈接可以從'today'頁面中刪除（除了鏈接到它本身，它不會被默認行爲重新請求）。 – 2012-08-15 17:59:49

warwaruk是正確的，DEPTH_LIMIT設置的默認值是0 - 即「不受限制」。

所以讓我們刮miniova，看看會發生什麼。在today頁面，我們看到開始有兩個TOR鏈接：

[email protected]:~$ scrapy shell http://www.mininova.org/today 
2012-08-15 12:27:57-0500 [scrapy] INFO: Scrapy 0.15.1 started (bot: scrapybot) 
>>> from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
>>> SgmlLinkExtractor(allow=['/tor/\d+']).extract_links(response) 
[Link(url='http://www.mininova.org/tor/13204738', text=u'[APSKAFT-018] Apskaft presents: Musique Concrte', fragment='', nofollow=False), Link(url='http://www.mininova.org/tor/13204737', text=u'e4g020-graphite412', fragment='', nofollow=False)]

讓我們湊第一環節，我們看到有一些頁面上沒有新的TOR鏈接，只是鏈接iteself，這並沒有得到默認情況下，重新抓取（scrapy.http.Request（網址[，... dont_filter =假，...]））：

>>> fetch('http://www.mininova.org/tor/13204738') 
2012-08-15 12:30:11-0500 [default] DEBUG: Crawled (200) <GET http://www.mininova.org/tor/13204738> (referer: None) 
>>> SgmlLinkExtractor(allow=['/tor/\d+']).extract_links(response) 
[Link(url='http://www.mininova.org/tor/13204738', text=u'General information', fragment='', nofollow=False)]

沒有運氣，我們仍處在深度1。讓我們來嘗試其他鏈接：

>>> fetch('http://www.mininova.org/tor/13204737') 
2012-08-15 12:31:20-0500 [default] DEBUG: Crawled (200) <GET http://www.mininova.org/tor/13204737> (referer: None) 
[Link(url='http://www.mininova.org/tor/13204737', text=u'General information', fragment='', nofollow=False)]

不，這個頁面只包含一個鏈接，以及，鏈接到自己，這也被過濾。所以實際上沒有鏈接，所以Scrapy關閉了蜘蛛（在深度== 1）。

來源

2012-08-15 18:13:08

我也有類似的問題，它幫助建立follow=True定義Rule時：

follow是一個布爾值，指定如果鏈接應該從這一規則提取每個響應之後。如果callback爲Nonefollow 默認爲True，否則默認爲False。

來源

2013-05-10 12:19:42

廣東話爬行scrapy超過1

回答

相關問題