2017-05-25 120 views
0

python新手,來自php。我想使用Scrapy來抓取一些網站,並且很好地學習了教程和簡單的腳本。現在寫實打實的來此錯誤:Scrapy傳遞響應,缺少一個位置參數

Traceback (most recent call last):

File "C:\Users\Naltroc\Miniconda3\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks current.result = callback(current.result, *args, **kw)

File "C:\Users\Naltroc\Documents\Python Scripts\tutorial\tutorial\spiders\quotes_spider.py", line 52, in parse self.dispatchersite

TypeError: thesaurus() missing 1 required positional argument: 'response'

Scrapy自動實例化時的shell命令scrapy crawl words被稱爲對象。

據我所知,self是任何類方法的第一個參數。在調用類方法時,不會將self作爲參數傳遞給您的變量。

首先這就是所謂的:

# Scrapy automatically provides `response` to `parse()` when coming from `start_requests()` 
def parse(self, response): 
     site = response.meta['site'] 
     #same as "site = thesaurus" 
     self.dispatcher[site](response) 
     #same as "self.dispatcher['thesaurus'](response) 

然後

def thesaurus(self, response): 
     filename = 'thesaurus.txt' 
     words = '' 
     ul = response.css('.relevancy-block ul') 
     for idx, u in enumerate(ul): 
      if idx == 1: 
       break; 
      words = u.css('.text::text').extract() 

     self.save_words(filename, words) 

在PHP中,這應該是與調用$this->thesaurus($response)parse顯然發送response作爲一個變量,但python說它缺少。 它去了哪裏?

全部代碼在這裏:

import scrapy 

class WordSpider(scrapy.Spider): 
    def __init__(self, keyword = 'apprehensive'): 
     self.k = keyword 
    name = "words" 
    # Utilities 
    def make_csv(self, words): 
     csv = '' 
     for word in words: 
      csv += word + ',' 
     return csv 

    def save_words(self, words, fp): 
     with ofpen(fp, 'w') as f: 
      f.seek(0) 
      f.truncate() 
      csv = self.make_csv(words) 
      f.write(csv) 

    # site specific parsers 
    def thesaurus(self, response): 
     filename = 'thesaurus.txt' 
     words = '' 
     print("in func self is defined as ", self) 
     ul = response.css('.relevancy-block ul') 
     for idx, u in enumerate(ul): 
      if idx == 1: 
       break; 
      words = u.css('.text::text').extract() 
      print("words is ", words) 

     self.save_words(filename, words) 

    def oxford(self): 
     filename = 'oxford.txt' 
     words = '' 

    def collins(self): 
     filename = 'collins.txt' 
     words = '' 

    # site/function mapping 
    dispatcher = { 
     'thesaurus': thesaurus, 
     'oxford': oxford, 
     'collins': collins, 
    } 

    def parse(self, response): 
     site = response.meta['site'] 
     self.dispatcher[site](response) 

    def start_requests(self): 
     urls = { 
      'thesaurus': 'http://www.thesaurus.com/browse/%s?s=t' % self.k, 
      #'collins': 'https://www.collinsdictionary.com/dictionary/english-thesaurus/%s' % self.k, 
      #'oxford': 'https://en.oxforddictionaries.com/thesaurus/%s' % self.k, 
     } 

     for site, url in urls.items(): 
      print(site, url) 
      yield scrapy.Request(url, meta={'site': site}, callback=self.parse) 

回答

2

有很多微小的erorrs的各地您的代碼。我冒昧地把它清理了一下,遵循常見的python/scrapy成語:)

import logging 
import scrapy 


# Utilities 
# should probably use csv module here or `scrapy crawl -o` flag instead 
def make_csv(words): 
    csv = '' 
    for word in words: 
     csv += word + ',' 
    return csv 


def save_words(words, fp): 
    with open(fp, 'w') as f: 
     f.seek(0) 
     f.truncate() 
     csv = make_csv(words) 
     f.write(csv) 


class WordSpider(scrapy.Spider): 
    name = "words" 

    def __init__(self, keyword='apprehensive', **kwargs): 
     super(WordSpider, self).__init__(**kwargs) 
     self.k = keyword 

    def start_requests(self): 
     urls = { 
      'thesaurus': 'http://www.thesaurus.com/browse/%s?s=t' % self.k, 
      # 'collins': 'https://www.collinsdictionary.com/dictionary/english-thesaurus/%s' % self.k, 
      # 'oxford': 'https://en.oxforddictionaries.com/thesaurus/%s' % self.k, 
     } 

     for site, url in urls.items(): 
      yield scrapy.Request(url, meta={'site': site}, callback=self.parse) 

    def parse(self, response): 
     parser = getattr(self, response.meta['site']) # retrieve method by name 
     logging.info(f'parsing using: {parser}') 
     parser(response) 

    # site specific parsers 
    def thesaurus(self, response): 
     filename = 'thesaurus.txt' 
     words = [] 
     print("in func self is defined as ", self) 
     ul = response.css('.relevancy-block ul') 
     for idx, u in enumerate(ul): 
      if idx == 1: 
       break 
      words = u.css('.text::text').extract() 
      print("words is ", words) 
     save_words(filename, words) 

    def oxford(self): 
     filename = 'oxford.txt' 
     words = '' 

    def collins(self): 
     filename = 'collins.txt' 
     words = '' 
+0

謝謝你的評論。 1.如果我知道它總是隻用'keyword'作爲參數,是否有理由在'__init__'中添加'** kwargs'? 2.它看起來像'parse'函數作爲一個控制器,首先得到正確的解析器然後傳遞數據。這是合理的,但它是發送「響應」數據的唯一方式嗎? 3.爲什麼使用'getattr(self,response.meta ['site'])'允許調用適當的方法而不用'self.'作爲前綴? – Naltroc

+1

關於#1。既然你從蜘蛛繼承你想傳遞kwargs到父類,這裏沒有什麼值得傳遞的東西,但這是一個讓這個未來證明的模式。 2.您誤解了scrapy的工作原理,默認情況下,蜘蛛會啓動一系列'start_urls'中的每個url的請求鏈,並使用默認回調'parse()',其中response是其中一個start_urls的響應對象。你誤解了什麼是自我; 'self'是對當前類對象的引用,所以當使用'getattr'時,你不需要它,因爲它爲你提供了一個獨立的引用。 – Granitosaurus

相關問題