2017-07-25 278 views
0

我試圖抓取我必須首先登錄的頁面,但由於某種原因,scrapy在使用FormRequest之後爬取了另一個無關的頁面。見下面我的代碼:無法使用scrapy登錄

# coding: utf-8 
import scrapy 
from scrapy.http import Request, FormRequest 

usuario = 'myemail' 
senha = 'mypassword' 
urllogin = 'https://ludopedia.com.br/login' 
urlnotificacoes = 'https://ludopedia.com.br/notificacoes' 

class notificacao(scrapy.Item): 
    """Contem os dados dos Anuncios da ludopedia""" 
    jogo = scrapy.Field() 
    colecao = scrapy.Field() 
    tipo = scrapy.Field() 
    link = scrapy.Field() 


class LoginSpider(scrapy.Spider): 
    name = 'ludopedia' 

    custom_settings = { 
     'CONCURRENT_REQUESTS': 1, 
     'LOG_LEVEL': 'DEBUG', 
    } 
    start_urls = [ urllogin ] 

    def parse(self, response): 
     return FormRequest.from_response(
      response, 
      formname='form', 
      formid='form', 
      formdata={'email': usuario, 'pass': senha}, 
      callback=self.after_login, 
      dont_filter=True 
      ) 

    def after_login(self, response): 
     # check login succeed before going on 
     if "Minha Conta" in response.body: 
      self.logger.error("Login falhou") 
      return 


     yield Request(urlnotificacoes) 

     self.logger.info("Visitei %s", response.url) 
     msg = response.selector.xpath ('//*[@id="page-content"]/div/div/table/tbody/tr[2]/td/a/div[2]/div') 
     ... 

這個腳本的輸出是:

2017-07-25 12:02:55 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapybot) 
2017-07-25 12:02:55 [scrapy.utils.log] INFO: Overridden settings: {'SPIDER_LOADER_WARN_ONLY': True} 
2017-07-25 12:02:56 [scrapy.middleware] INFO: Enabled extensions: 
['scrapy.extensions.memusage.MemoryUsage', 
'scrapy.extensions.logstats.LogStats', 
'scrapy.extensions.telnet.TelnetConsole', 
'scrapy.extensions.corestats.CoreStats'] 
2017-07-25 12:02:56 [scrapy.middleware] INFO: Enabled downloader middlewares: 
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 
'scrapy.downloadermiddlewares.retry.RetryMiddleware', 
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 
'scrapy.downloadermiddlewares.stats.DownloaderStats'] 
2017-07-25 12:02:56 [scrapy.middleware] INFO: Enabled spider middlewares: 
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 
'scrapy.spidermiddlewares.referer.RefererMiddleware', 
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 
'scrapy.spidermiddlewares.depth.DepthMiddleware'] 
2017-07-25 12:02:56 [scrapy.middleware] INFO: Enabled item pipelines: 
[] 
2017-07-25 12:02:56 [scrapy.core.engine] INFO: Spider opened 
2017-07-25 12:02:56 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2017-07-25 12:02:56 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024 
2017-07-25 12:02:58 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ludopedia.com.br/login> (referer: None) 
2017-07-25 12:02:59 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://ludopedia.com.br/login> (referer: https://ludopedia.com.br/login) 
2017-07-25 12:02:59 [ludopedia] INFO: Visitei https://ludopedia.com.br/login 
<200 https://ludopedia.com.br/login> 
2017-07-25 12:03:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ludopedia.com.br/notificacoes> (referer: https://ludopedia.com.br/login) 
2017-07-25 12:03:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ludopedia.com.br/search?search=&email=myemail&pass=mypassword> (referer: https://ludopedia.com.br/notificacoes) 
2017-07-25 12:03:01 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://ludopedia.com.br/notificacoes> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates) 
2017-07-25 12:03:01 [ludopedia] INFO: Visitei https://ludopedia.com.br/search?search=&email=myemail&pass=mypassword 
<200 https://ludopedia.com.br/search?search=&email=myemail&pass=mypassword> 
2017-07-25 12:03:01 [scrapy.core.engine] INFO: Closing spider (finished) 
2017-07-25 12:03:01 [scrapy.statscollectors] INFO: Dumping Scrapy stats: 
{'downloader/request_bytes': 1357, 
'downloader/request_count': 4, 
'downloader/request_method_count/GET': 3, 
'downloader/request_method_count/POST': 1, 
'downloader/response_bytes': 134813, 
'downloader/response_count': 4, 
'downloader/response_status_count/200': 4, 
'dupefilter/filtered': 1, 
'finish_reason': 'finished', 
'finish_time': datetime.datetime(2017, 7, 25, 15, 3, 1, 355077), 
'log_count/DEBUG': 6, 
'log_count/INFO': 9, 
'memusage/max': 51732480, 
'memusage/startup': 51732480, 
'request_depth_max': 4, 
'response_received_count': 4, 
'scheduler/dequeued': 4, 
'scheduler/dequeued/memory': 4, 
'scheduler/enqueued': 4, 
'scheduler/enqueued/memory': 4, 
'start_time': datetime.datetime(2017, 7, 25, 15, 2, 56, 35121)} 
2017-07-25 12:03:01 [scrapy.core.engine] INFO: Spider closed (finished) 

所以,問題是,由於某種原因,我越來越重定向到ludopedia.com.br/ search?search = & email = myemail & pass = mypassword但我不知道爲什麼。

我想要做的是,訪問ludopedia.com.br/login,用電子郵件和密碼填寫表格,然後訪問ludopedia.com.br/notificacoes,然後解析HTML。

如何避免鏈接ludopedia.com.br/search?search= &電子郵件= myemail &通=輸入mypassword

+0

如果它從登錄提交重定向,我認爲你不能避免它。 – T4rk1n

+0

這是一個恥辱,因爲我可以實現這一點使用捲曲,如捲曲https://ludopedia.com.br/login -d電子郵件='電子郵件'-d通行證='密碼'-c cookie登錄和捲曲https:/ /ludopedia.com.br/notificacoes -b cookie來閱讀網頁。 – carnedepassaro

回答

0

我做到了!我認爲這是一個邏輯問題,這是我的工作代碼:

# coding: utf-8 
import scrapy 
from scrapy.http import Request, FormRequest 

usuario = 'myemail' 
senha = 'mypassword' 
urllogin = 'https://ludopedia.com.br/login' 
urlnotificacoes = 'https://ludopedia.com.br/notificacoes' 

class notificacao(scrapy.Item): 
    """Contem os dados dos Anuncios da ludopedia""" 
    jogo = scrapy.Field() 
    colecao = scrapy.Field() 
    tipo = scrapy.Field() 
    link = scrapy.Field() 


class LoginSpider(scrapy.Spider): 
    name = 'ludopedia' 

    custom_settings = { 
     'CONCURRENT_REQUESTS': 1, 
     'LOG_LEVEL': 'DEBUG', 
    } 
    start_urls = [ urllogin ] 

    def parse(self, response): 
     return FormRequest.from_response(
      response, 
      formname='form', 
      formid='form', 
      formdata={'email': usuario, 'pass': senha}, 
      callback=self.after_login, 
      dont_filter=True 
      ) 

    def after_login(self, response): 
     # check login succeed before going on 
     if "Minha Conta" in response.body: 
      self.logger.error("Login falhou") 
      return 

     request = Request(urlnotificacoes, callback=self.parse_notificacoes) 
     yield request 

    def parse_notificacoes(self, response): 
     msg = response.selector.xpath ('//*[@id="page-content"]/div/div/table/tbody/tr[2]/td/a/div[2]/div') 
     ... 

這裏的區別是,我補充說,在「after_login」我想要刮的頁面的請求,然後我用一個回調,到另一個函數來解析新的響應,最後我添加了一個「yield request」,然後解析頁面(現在嵌入)白色的函數「parse_notificacoes」。