如何喂蜘蛛蜘蛛爬行內的鏈接？

我正在爲網上商店寫一個蜘蛛（CrawlSpider）。根據客戶需求，我需要編寫兩個規則：一個用於確定哪些頁面有項目，另一個用於提取項目。如何喂蜘蛛蜘蛛爬行內的鏈接？

我已經獨立工作的這兩個規則：

如果我start_urls = ["www.example.com/books.php", "www.example.com/movies.php"]和我評論的Rule和代碼parse_category ，我parse_item將提取每一個項目。
在另一方面，如果start_urls = "http://www.example.com"我發表意見Rule和parse_item代碼，parse_category將返回的每一個環節，其中有一個項目提取，即 parse_category將返回www.example.com/books.php和 www.example.com/movies.php。

我的問題是，我不知道怎麼兩個模塊合併，使start_urls = "http://www.example.com"然後parse_category提取www.example.com/books.php和www.example.com/movies.php和飼料這些鏈接到parse_item，在那裏我居然提取每個項目的信息。

我需要找到一種方法來做到這一點，而不是僅僅使用start_urls = ["www.example.com/books.php", "www.example.com/movies.php"]，因爲如果將來添加了新類別（例如www.example.com/music.php），蜘蛛將無法自動檢測到新類別，應該手動編輯。沒什麼大不了的，但客戶不想要這個。

class StoreSpider (CrawlSpider): 
    name = "storyder" 

    allowed_domains = ["example.com"] 
    start_urls = ["http://www.example.com/"] 
    #start_urls = ["http://www.example.com/books.php", "http://www.example.com/movies.php"] 

    rules = (
     Rule(LinkExtractor(), follow=True, callback='parse_category'), 
     Rule(LinkExtractor(), follow=False, callback="parse_item"), 
    ) 

def parse_category(self, response): 
    category = StoreCategory() 
    # some code for determining whether the current page is a category, or just another stuff 
    if is a category: 
     category['name'] = name 
     category['url'] = response.url 
    return category 

def parse_item(self, response): 
    item = StoreItem() 
    # some code for extracting the item's data 
    return item

來源

2015-11-02 yzT

相反使用parse_category，我在LinkExtractor中使用restrict_css來獲得我想要的鏈接，並且它似乎在提取第二個Rule與提取的鏈接，所以我的問題得到了回答。它結束了這種方式：

class StoreSpider (CrawlSpider): 
    name = "storyder" 

    allowed_domains = ["example.com"] 
    start_urls = ["http://www.example.com/"] 

    rules = (
     Rule(LinkExtractor(restrict_css=("#movies", "#books"))), 
     Rule(LinkExtractor(), callback="parse_item"), 
    ) 

def parse_item(self, response): 
    item = StoreItem() 
    # some code for extracting the item's data 
    return item

仍無法檢測到新添加的類別（並沒有使用在restrict_css沒有獲取其他垃圾花紋清晰），但至少它與的的先決條件符合客戶端：2個規則，一個用於提取類別的鏈接，另一個用於提取項目的數據。

來源

2015-11-02 10:43:58 yzT

CrawlSpider規則不能像你想要的那樣工作，你需要自己實現邏輯。當您指定follow=True你不能使用回叫，因爲思想是保持獲取鏈接（沒有項目），而遵守規則，檢查documentation

你可以用類似嘗試：

class StoreSpider (CrawlSpider): 
    name = "storyder" 

    allowed_domains = ["example.com"] 
    start_urls = ["http://www.example.com/"] 
    # no rules 
def parse(self, response): # this is parse_category 
    category_le = LinkExtractor("something for categories") 
    for a in category_le.extract_links(response): 
     yield Request(a.url, callback=self.parse_category) 
    item_le = LinkExtractor("something for items") 
    for a in item_le.extract_links(response): 
     yield Request(a.url, callback=self.parse_item) 
def parse_category(self, response): 
    category = StoreCategory() 
    # some code for determining whether the current page is a category, or just another stuff 
    if is a category: 
     category['name'] = name 
     category['url'] = response.url 
     yield category 
    for req in self.parse(response): 
     yield req 
def parse_item(self, response): 
    item = StoreItem() 
    # some code for extracting the item's data 
    return item

來源

2015-11-02 01:54:57 eLRuLL

'scrapy crawl storyder -o output.json -t json'的輸出只是類別列表和其他一些鏈接，但根本沒有任何項目。國際海事組織，它不進入'parse_item'因爲檢查日誌，當它抓取一個項目的鏈接，它返回名稱和URL，這是StoreCategory的字段。 – yzT

如何喂蜘蛛蜘蛛爬行內的鏈接？

回答

相關問題