我試圖抓取我必須首先登錄的頁面,但由於某種原因,scrapy在使用FormRequest之後爬取了另一個無關的頁面。見下面我的代碼:無法使用scrapy登錄
# coding: utf-8
import scrapy
from scrapy.http import Request, FormRequest
usuario = 'myemail'
senha = 'mypassword'
urllogin = 'https://ludopedia.com.br/login'
urlnotificacoes = 'https://ludopedia.com.br/notificacoes'
class notificacao(scrapy.Item):
"""Contem os dados dos Anuncios da ludopedia"""
jogo = scrapy.Field()
colecao = scrapy.Field()
tipo = scrapy.Field()
link = scrapy.Field()
class LoginSpider(scrapy.Spider):
name = 'ludopedia'
custom_settings = {
'CONCURRENT_REQUESTS': 1,
'LOG_LEVEL': 'DEBUG',
}
start_urls = [ urllogin ]
def parse(self, response):
return FormRequest.from_response(
response,
formname='form',
formid='form',
formdata={'email': usuario, 'pass': senha},
callback=self.after_login,
dont_filter=True
)
def after_login(self, response):
# check login succeed before going on
if "Minha Conta" in response.body:
self.logger.error("Login falhou")
return
yield Request(urlnotificacoes)
self.logger.info("Visitei %s", response.url)
msg = response.selector.xpath ('//*[@id="page-content"]/div/div/table/tbody/tr[2]/td/a/div[2]/div')
...
這個腳本的輸出是:
2017-07-25 12:02:55 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapybot)
2017-07-25 12:02:55 [scrapy.utils.log] INFO: Overridden settings: {'SPIDER_LOADER_WARN_ONLY': True}
2017-07-25 12:02:56 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2017-07-25 12:02:56 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-07-25 12:02:56 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-07-25 12:02:56 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-07-25 12:02:56 [scrapy.core.engine] INFO: Spider opened
2017-07-25 12:02:56 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-07-25 12:02:56 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
2017-07-25 12:02:58 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ludopedia.com.br/login> (referer: None)
2017-07-25 12:02:59 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://ludopedia.com.br/login> (referer: https://ludopedia.com.br/login)
2017-07-25 12:02:59 [ludopedia] INFO: Visitei https://ludopedia.com.br/login
<200 https://ludopedia.com.br/login>
2017-07-25 12:03:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ludopedia.com.br/notificacoes> (referer: https://ludopedia.com.br/login)
2017-07-25 12:03:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ludopedia.com.br/search?search=&email=myemail&pass=mypassword> (referer: https://ludopedia.com.br/notificacoes)
2017-07-25 12:03:01 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://ludopedia.com.br/notificacoes> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2017-07-25 12:03:01 [ludopedia] INFO: Visitei https://ludopedia.com.br/search?search=&email=myemail&pass=mypassword
<200 https://ludopedia.com.br/search?search=&email=myemail&pass=mypassword>
2017-07-25 12:03:01 [scrapy.core.engine] INFO: Closing spider (finished)
2017-07-25 12:03:01 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1357,
'downloader/request_count': 4,
'downloader/request_method_count/GET': 3,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 134813,
'downloader/response_count': 4,
'downloader/response_status_count/200': 4,
'dupefilter/filtered': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 7, 25, 15, 3, 1, 355077),
'log_count/DEBUG': 6,
'log_count/INFO': 9,
'memusage/max': 51732480,
'memusage/startup': 51732480,
'request_depth_max': 4,
'response_received_count': 4,
'scheduler/dequeued': 4,
'scheduler/dequeued/memory': 4,
'scheduler/enqueued': 4,
'scheduler/enqueued/memory': 4,
'start_time': datetime.datetime(2017, 7, 25, 15, 2, 56, 35121)}
2017-07-25 12:03:01 [scrapy.core.engine] INFO: Spider closed (finished)
所以,問題是,由於某種原因,我越來越重定向到ludopedia.com.br/ search?search = & email = myemail & pass = mypassword但我不知道爲什麼。
我想要做的是,訪問ludopedia.com.br/login,用電子郵件和密碼填寫表格,然後訪問ludopedia.com.br/notificacoes,然後解析HTML。
如何避免鏈接ludopedia.com.br/search?search= &電子郵件= myemail &通=輸入mypassword?
如果它從登錄提交重定向,我認爲你不能避免它。 – T4rk1n
這是一個恥辱,因爲我可以實現這一點使用捲曲,如捲曲https://ludopedia.com.br/login -d電子郵件='電子郵件'-d通行證='密碼'-c cookie登錄和捲曲https:/ /ludopedia.com.br/notificacoes -b cookie來閱讀網頁。 – carnedepassaro