0
我第一次嘗試scrapy CrawlSpider子類。我創建了強烈的基礎上,文檔例如下面的蜘蛛在https://doc.scrapy.org/en/latest/topics/spiders.html#crawlspider-example:Scrapy爬行器設置規則
class Test_Spider(CrawlSpider):
name = "test"
allowed_domains = ['http://www.dragonflieswellness.com']
start_urls = ['http://www.dragonflieswellness.com/wp-content/uploads/2015/09/']
rules = (
# Extract links matching 'category.php' (but not matching 'subsection.php')
# and follow links from them (since no callback means follow=True by default).
# Rule(LinkExtractor(allow=('category\.php',), deny=('subsection\.php',))),
# Extract links matching 'item.php' and parse them with the spider's method parse_item
Rule(LinkExtractor(allow='.jpg'), callback='parse_item'),
)
def parse_item(self, response):
self.logger.info('Hi, this is an item page! %s', response.url)
print(response.url)
我試圖讓蜘蛛循環開始在prescibed目錄,然後提取了所有的「.JPG」鏈接目錄,但我看到:
2016-09-29 13:07:35 [scrapy] INFO: Spider opened
2016-09-29 13:07:35 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-09-29 13:07:35 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-09-29 13:07:36 [scrapy] DEBUG: Crawled (200) <GET http://www.dragonflieswellness.com/wp-content/uploads/2015/09/> (referer: None)
2016-09-29 13:07:36 [scrapy] INFO: Closing spider (finished)
我該如何得到這個工作?
謝謝,這確實有幫助,但我仍然試圖瞭解它是如何工作的。我想在這種情況下下載jpg文件,所以我可以要求一個例子,包括管道功能? – user61629
看看我編輯的答案。 – mihal277
感謝您抽出時間。我有興趣看到人們採取不同的方法。 – user61629