Scrapy爬行器設置規則

我第一次嘗試scrapy CrawlSpider子類。我創建了強烈的基礎上，文檔例如下面的蜘蛛在https://doc.scrapy.org/en/latest/topics/spiders.html#crawlspider-example：Scrapy爬行器設置規則

class Test_Spider(CrawlSpider): 

    name = "test" 

    allowed_domains = ['http://www.dragonflieswellness.com'] 
    start_urls = ['http://www.dragonflieswellness.com/wp-content/uploads/2015/09/'] 

    rules = (
     # Extract links matching 'category.php' (but not matching 'subsection.php') 
     # and follow links from them (since no callback means follow=True by default). 
     # Rule(LinkExtractor(allow=('category\.php',), deny=('subsection\.php',))), 

     # Extract links matching 'item.php' and parse them with the spider's method parse_item 
     Rule(LinkExtractor(allow='.jpg'), callback='parse_item'), 
    ) 

    def parse_item(self, response): 
     self.logger.info('Hi, this is an item page! %s', response.url) 
     print(response.url)

我試圖讓蜘蛛循環開始在prescibed目錄，然後提取了所有的「.JPG」鏈接目錄，但我看到：

2016-09-29 13:07:35 [scrapy] INFO: Spider opened 
2016-09-29 13:07:35 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2016-09-29 13:07:35 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 
2016-09-29 13:07:36 [scrapy] DEBUG: Crawled (200) <GET http://www.dragonflieswellness.com/wp-content/uploads/2015/09/> (referer: None) 
2016-09-29 13:07:36 [scrapy] INFO: Closing spider (finished)

我該如何得到這個工作？

來源

2016-09-29 user61629

首先，使用規則的目的不僅是提取鏈接，而且最重要的是遵循它們。如果您只想提取鏈接（並且稍後保存它們），則不必指定蜘蛛規則。另一方面，如果您想要下載圖像，請使用pipeline。

這就是說，蜘蛛不跟隨鏈接隱藏在LinkExtractor實施的原因：

# common file extensions that are not followed if they occur in links 
IGNORED_EXTENSIONS = [ 
    # images 
    'mng', 'pct', 'bmp', 'gif', 'jpg', 'jpeg', 'png', 'pst', 'psp', 'tif', 
    'tiff', 'ai', 'drw', 'dxf', 'eps', 'ps', 'svg', 

    # audio 
    'mp3', 'wma', 'ogg', 'wav', 'ra', 'aac', 'mid', 'au', 'aiff', 

    # video 
    '3gp', 'asf', 'asx', 'avi', 'mov', 'mp4', 'mpg', 'qt', 'rm', 'swf', 'wmv', 
'm4a', 

    # office suites 
    'xls', 'xlsx', 'ppt', 'pptx', 'pps', 'doc', 'docx', 'odt', 'ods', 'odg', 
'odp', 

    # other 
    'css', 'pdf', 'exe', 'bin', 'rss', 'zip', 'rar', 
]

編輯：

爲了下載在這個例子中使用ImagesPipeline圖片：

一下添加到設置：

ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1} 

IMAGES_STORE = '/home/user/some_directory' # use a correct path

創建一個新項目：

class MyImageItem(Item): 
    images = Field() 
    image_urls = Field()

修改您的蜘蛛（添加解析方法）：

def parse(self, response): 
     loader = ItemLoader(item=MyImageItem(), response=response) 
     img_paths = response.xpath('//a[substring(@href, string-length(@href)-3)=".jpg"]/@href').extract() 
     loader.add_value('image_urls', [self.start_urls[0] + img_path for img_path in img_paths]) 
     return loader.load_item()

對於以「.jpg」，並提取結束所有的HREF的XPath搜索（）方法創建一個列表。

裝載程序是一種簡化創建對象的附加功能，但是您可以不使用它。

請注意，我不是專家，可能會有更好，更優雅的解決方案。然而，這一個工作正常。

來源

2016-10-01 19:31:33 mihal277

謝謝，這確實有幫助，但我仍然試圖瞭解它是如何工作的。我想在這種情況下下載jpg文件，所以我可以要求一個例子，包括管道功能？ – user61629

看看我編輯的答案。 – mihal277

感謝您抽出時間。我有興趣看到人們採取不同的方法。 – user61629

Scrapy爬行器設置規則

回答

相關問題