如何使用Scrapy遞歸爬取子頁面

因此，基本上我試圖抓取一組具有一組類別的頁面，刮取每個類別的名稱，按照與每個類別相關聯的子鏈接到具有一組子類別的頁面，名稱，然後按照每個子類別關聯頁面並檢索文本數據。當時我想輸出端的JSON文件格式有點像：如何使用Scrapy遞歸爬取子頁面

類別1名
- 子目錄1名該子類別的頁面
子目錄ň名
- 數據
  - 數據此頁
種類n的名字從子類個n頁

等

子目錄1名
- 數據
  
  最後，我希望能夠用這個數據與ElasticSearch
  
  我幾乎有Scrapy任何經驗，這是我迄今（只是從第一頁刮擦類的名字，我不知道是什麼從這裏做）...從我的研究中，我相信我需要使用CrawlSpider，但我不確定這會帶來什麼。我也被建議使用BeautifulSoup。任何幫助將不勝感激。
```
class randomSpider(scrapy.Spider): 
    name = "helpme" 
    allowed_domains = ["example.com"] 
    start_urls = ['http://example.com/categories',] 

    def parse(self, response): 
     for i in response.css('div.CategoryTreeSection'): 
      yield { 
       'categories': i.css('a::text').extract_first() 
      } 
```

來源

2017-05-31 jetstream131

如果可以的話，給我們網站的地址 – parik

不熟悉ElasticSearch但我想建立這樣的刮刀：

class randomSpider(scrapy.Spider): 
    name = "helpme" 
    allowed_domains = ["example.com"] 
    start_urls = ['http://example.com/categories',] 

    def parse(self, response): 
     for i in response.css('div.CategoryTreeSection'): 
      subcategory = i.css('Put your selector here') # This is where you select the subcategory url 
      req = scrapy.Request(subcategory, callback=self.parse_subcategory) 
      req.meta['category'] = i.css('a::text').extract_first() 
      yield req 

    def parse_subcategory(self, response): 
     yield { 
      'category' : response.meta.get('category') 
      'subcategory' : response.css('Put your selector here') # Select the name of the subcategory 
      'subcategorydata' : response.css('Put your selector here') # Select the data of the subcategory 
     }

您收集的子類別網址和發送請求。此請求的回覆將在parse_subcategory中打開。在發送此請求時，我們在元數據中添加類別名稱。

在parse_subcategory函數中，您從元數據中獲取類別名稱，並從網頁中收集子類別數據。

來源

2017-06-01 09:01:23 Casper

如何使用Scrapy遞歸爬取子頁面

回答

相關問題