2016-11-19 121 views
0

我是scrapy的新手。Scrapy請求回調不會觸發

我想要報廢A - > B - > C - > A - > B - > C - > ...循環。

但是,item_scraped回調之後的請求未觸發。

我不知道爲什麼回調函數沒有觸發?

下面是我的蜘蛛代碼。

import scrapy 
from scrapy import signals 
import time 
import settings 

from scrapy.loader.processors import MapCompose 
from scrapy.loader import ItemLoader 
from items import StudentID, StudentInfo 

class GetidSpider(scrapy.Spider): 
    name = "getid" 

    custom_settings = { 
     'ITEM_PIPELINES' : { 
      'pipelines.GetidPipeline' : 300 
     } 
    } 

    @classmethod 
    def from_crawler(cls, crawler, *args, **kwargs): 
     spider = super(GetidSpider, cls).from_crawler(crawler, *args, **kwargs) 
     crawler.signals.connect(spider.item_scraped, signal = signals.item_scraped) 
     crawler.signals.connect(spider.spider_closed, signal = signals.spider_closed) 
     return spider 

    def __init__(self, login_id = None, login_pwd = None, Center = None): 
     self.login_id = login_id 
     self.login_pwd = login_pwd 
     self.CENTER = Center 

    def start_requests(self): 
     yield scrapy.Request("https://sdszone1.e-wsi.com/index.jhtml", self.login) 

    def login(self, response): 
     return scrapy.FormRequest.from_response(
      response, 
      formname = 'Logon', 
      formdata = { 
       'login' : self.login_id, 
       'password' : self.login_pwd 
      }, 
      callback=self.get_student_id 
     ) 

    def get_student_id(self, response): 
     for title in response.xpath('//title/text()').extract(): 
      if title == "SDS : Main": 
       self.student_info_count = 3 
       return scrapy.Request('http://sdszone1.e-wsi.com/standard/followup/studyrecord/studentstudyrecord.jhtml', 
            callback=self.print_student_info) 

    def print_student_info(self, response): 
     print self.student_info_count 
     if self.student_info_count > 0: 
      print "in if" 
      yield scrapy.Request('http://sdszone1.e-wsi.com/standard/followup/studyrecord/contracts.jhtml?studentCode=18138', 
       callback=self.save_student_info) 
     else : 
      print "in else" 
      yield scrapy.Request('http://sdszone1.e-wsi.com/standard/index.jhtml') 

    def save_student_info(self, response): 
     print "in save_student_info" 
     print response.xpath('//input[@type="hidden"][@name="profileId"]/@value').extract() 
     if response.xpath('//input[@type="hidden"][@name="profileId"]/@value').extract() == "" : 
      yield scrapy.Request('http://sdszone1.e-wsi.com/standard/index.jhtml') 
     else : 
      student_info = ItemLoader(item=StudentInfo(), response=response) 
      student_info.add_value('item_name', 'student_info') 
      student_info.add_xpath('SDS_No', '//table/tr/td[@width="100%"][@class="text"]/text()', MapCompose(unicode.strip, unicode.title)) 
      student_info.add_xpath('StartLevel', '//table/tbody/tr/td[@class="text"][3]/text()', MapCompose(unicode.strip, unicode.title)) 
      student_info.add_xpath('EndLevel', '//table/tbody/tr/td[@class="text"][5]/text()', MapCompose(unicode.strip, unicode.title)) 
      student_info.add_xpath('ProEnglish', '//table/tbody/tr/td[@class="text"][8]/table/tbody/tr/td[2]/text()', MapCompose(unicode.strip, unicode.title)) 

      yield student_info.load_item() 
      del student_info 

    def item_scraped(self, item, spider): 
     if self.student_count > 0 : 
      self.student_count -= 1 
      print "in student_count" 
     elif self.student_info_count > 0 : 
      self.student_info_count -= 1 
      print "in student_info_count" 
      return scrapy.Request('http://sdszone1.e-wsi.com/standard/index.jhtml', callback=self.print_student_info) 

    def spider_closed(self, spider): 
     print "SPIDER IS CLOSED" 

和,下面是log。下面

2016-11-19 18:42:36 [scrapy] INFO: Spider opened 
2016-11-19 18:42:36 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2016-11-19 18:42:36 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 
2016-11-19 18:42:37 [scrapy] DEBUG: Crawled (404) <GET https://sdszone1.e-wsi.com/robots.txt> (referer: None) 
2016-11-19 18:42:38 [scrapy] DEBUG: Crawled (200) <GET https://sdszone1.e-wsi.com/index.jhtml> (referer: None) 
2016-11-19 18:42:38 [scrapy] DEBUG: Redirecting (meta refresh) to <GET https://sdszone1.e-wsi.com/standard/index.jhtml> from <POST https://sdszone1.e-wsi.com/index.jhtml?_DARGS=/index.jhtml.3&_dynSessConf=4369572730097781326> 
2016-11-19 18:42:38 [scrapy] DEBUG: Redirecting (302) to <GET http://sdszone1.e-wsi.com/standard/index.jhtml> from <GET https://sdszone1.e-wsi.com/standard/index.jhtml> 
2016-11-19 18:42:39 [scrapy] DEBUG: Crawled (200) <GET http://sdszone1.e-wsi.com/standard/index.jhtml> (referer: https://sdszone1.e-wsi.com/index.jhtml) 
2016-11-19 18:42:39 [scrapy] DEBUG: Crawled (200) <GET http://sdszone1.e-wsi.com/standard/followup/studyrecord/studentstudyrecord.jhtml> (referer: http://sdszone1.e-wsi.com/standard/index.jhtml) 
3 
in if 
2016-11-19 18:42:40 [scrapy] DEBUG: Crawled (200) <GET http://sdszone1.e-wsi.com/standard/followup/studyrecord/contracts.jhtml?studentCode=18138> (referer: http://sdszone1.e-wsi.com/standard/followup/studyrecord/studentstudyrecord.jhtml) 
in save_student_info 
[u'E530633464'] 
2016-11-19 18:42:40 [scrapy] DEBUG: Scraped from <200 http://sdszone1.e-wsi.com/standard/followup/studyrecord/contracts.jhtml?studentCode=18138> 

None 
in student_info_count 
2016-11-19 18:42:40 [scrapy] INFO: Closing spider (finished) 
SPIDER IS CLOSED 
2016-11-19 18:42:40 [scrapy] INFO: Dumping Scrapy stats: 
{'downloader/request_bytes': 3500, 
'downloader/request_count': 7, 
'downloader/request_method_count/GET': 6, 
'downloader/request_method_count/POST': 1, 
'downloader/response_bytes': 18150, 
'downloader/response_count': 7, 
'downloader/response_status_count/200': 5, 
'downloader/response_status_count/302': 1, 
'downloader/response_status_count/404': 1, 
'finish_reason': 'finished', 
'finish_time': datetime.datetime(2016, 11, 19, 9, 42, 40, 192000), 
'item_scraped_count': 1, 
'log_count/DEBUG': 9, 
'log_count/INFO': 7, 
'request_depth_max': 3, 
'response_received_count': 5, 
'scheduler/dequeued': 6, 
'scheduler/dequeued/memory': 6, 
'scheduler/enqueued': 6, 
'scheduler/enqueued/memory': 6, 
'start_time': datetime.datetime(2016, 11, 19, 9, 42, 36, 494000)} 
2016-11-19 18:42:40 [scrapy] INFO: Spider closed (finished) 
Done 
[Finished in 5.6s] 

是管道代碼

class GetidPipeline(object): 
    def __init__(self): 
     pass 
    def process_item(self, item, spider): 
     print item 
    def __del__(self): 
     pass 

登錄看起來就像一個頁面廢料,做..

我不知道發生了什麼

謝謝。

回答

1

Scrapy中的請求(和項目)只能由crawler.engine對象處理,這就是爲什麼spider回調方法(不知道它在哪裏)由此對象內部處理的原因。

這不會發生在信號方法,流水線,擴展,中間件等等。只有在蜘蛛回調方法。

所以通常情況下,當你想抓取一個網站,然後返回一個項目,你只需要調用鏈中的每個請求,因爲start_requests方法,然後直到最後一個回調函數返回一個項目。儘管如此,你也可以強制Scrapy到請求加入到它的發動機,這樣的:

self.crawler.engine.crawl(
    Request(
     'http://sdszone1.e-wsi.com/standard/index.jhtml', 
     callback=self.print_student_info, 
    ), 
    spider, 
) 
+0

抱歉..我不知道..我怎麼能修改我的代碼.. –

+0

你應該避風港編輯選項你的問題,但是有什麼問題? – eLRuLL