初始問題

我寫一個CrawlSpider類（使用scrapy庫）的方法創建單元測試和依靠大量的scrapy異步魔法，使其工作。這是，剝離下來：爲scrapy CrawlSpider

class MySpider(CrawlSpider): 
    rules = [Rule(LinkExtractor(allow='myregex'), callback='parse_page')] 
    # some other class attributes 

    def __init__(self, *args, **kwargs): 
     super(MySpider, self).__init__(*args, **kwargs) 
     self.response = None 
     self.loader = None 

    def parse_page_section(self): 
     soup = BeautifulSoup(self.response.body, 'lxml') 
     # Complicated scraping logic using BeautifulSoup 
     self.loader.add_value(mykey, myvalue) 

    # more methods parsing other sections of the page 
    # also using self.response and self.loader 

    def parse_page(self, response): 
     self.response = response 
     self.loader = ItemLoader(item=Item(), response=response) 
     self.parse_page_section() 
     # call other methods to collect more stuff 
     self.loader.load_item()

class屬性rule告訴我蜘蛛遵循一定的聯繫，並跳轉到一個回調函數一旦網絡頁面下載。我的目標是測試稱爲parse_page_section的解析方法，無需運行爬蟲，甚至無需發出真正的HTTP請求。

我試過

出於本能，我轉身向mock庫。我明白你是如何模擬一個函數來測試它是否已被調用（哪些參數以及是否有任何副作用......），但這不是我想要的。我想實例化一個假對象MySpider並分配足夠的屬性以便能夠調用parse_page_section方法。

在上述例子中，我需要一個response對象來實例化我ItemLoader和具體爲self.response.body屬性來實例我BeautifulSoup。原則上，我可以做虛假對象是這樣的：

from argparse import Namespace 

my_spider = MySpider(CrawlSpider) 
my_spider.response = NameSpace(body='<html>...</html>')

行之有效，爲BeautifulSoup類，但我需要增加更多的屬性創建ItemLoader對象。對於更復雜的情況，它會變得醜陋難以控制。

我的問題

這是完全正確的方法嗎？我在網上找不到類似的例子，所以我認爲我的方法在更基礎的層面上可能是錯誤的。任何有識之士將不勝感激。

來源

2016-04-28 cyberbikepunk

@ChrisP感謝您的編輯。我並沒有把scrapy標籤放在首位，因爲我認爲這個問題一般與單元測試有關。 – cyberbikepunk

這絕對是單元測試，但是大量進行刮擦的人可能會對單元測試刮板有一些獨特的見解。 – ChrisP

在這個特殊的'CrawlSpider'的情況下，我可以擺脫僞造響應對象。手工操作很困難，但這有幫助嗎？ http://requests-mock.readthedocs.io/en/latest/overview.html。這會是一個好方法嗎？ – cyberbikepunk

你見過Spiders Contracts？

這允許你測試你的蜘蛛的每個回調，而不需要很多代碼。例如：

def parse(self, response): 
    """ This function parses a sample response. Some contracts are mingled 
    with this docstring. 

    @url http://www.amazon.com/s?field-keywords=selfish+gene 
    @returns items 1 16 
    @returns requests 0 0 
    @scrapes Title Author Year Price 
    """

使用check命令運行合同檢查。

看看這個answer，如果你想要更大的東西。

來源

2016-04-28 15:27:38

我認爲這是有意義的，因爲網站本身可以改變，所以用*真實生活*（集成）測試代替單元測試。從本質上講，你的單元測試工作並不能保證你的拼寫工作。感謝您的建議。 – cyberbikepunk

雖然在單元測試中仍然有價值，但至少在編碼時會進行理智檢查。你提供的另一個答案（http://stackoverflow.com/questions/6456304/scrapy-unit-testing/12741030#12741030]顯示瞭如何通過實際使用'scrapy'' Request'來更好地僞造一個響應對象，並且'響應'對象。好的提示。 – cyberbikepunk

爲scrapy CrawlSpider

初始問題

我試過

我的問題

回答

相關問題