我是Scrapy的新手,目前我正在嘗試編寫一個CrawlSpider來抓取Tor darknet上的論壇。目前我CrawlSpider代碼:如何使用我的scrapy CrawlSpider將相對路徑轉換爲絕對路徑?
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class HiddenAnswersSpider(CrawlSpider):
name = 'ha'
start_urls = ['http://answerstedhctbek.onion/questions']
allowed_domains = ['http://answerstedhctbek.onion', 'answerstedhctbek.onion']
rules = (
Rule(LinkExtractor(allow=(r'answerstedhctbek.onion/\d\.\*', r'https://answerstedhctbek.onion/\d\.\*')), follow=True, process_links='makeAbsolutePath'),
Rule(LinkExtractor(allow=()), follow=True, process_links='makeAbsolutePath')
)
def makeAbsolutePath(links):
for i in range(links):
links[i] = links[i].replace("../","")
return links
由於論壇使用相對路徑,我試圖創建一個自定義process_links去掉「../」但是當我跑我的代碼我仍然recieving:
2017-11-11 14:46:46 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 http://answerstedhctbek.onion/../badges>: HTTP status code is not handled or not allowed
2017-11-11 14:46:46 [scrapy.core.engine] DEBUG: Crawled (400) <GET http://answerstedhctbek.onion/../general-guidelines> (referer: http://answerstedhctbek.onion/questions)
2017-11-11 14:46:47 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 http://answerstedhctbek.onion/../general-guidelines>: HTTP status code is not handled or not allowed
2017-11-11 14:46:47 [scrapy.core.engine] DEBUG: Crawled (400) <GET http://answerstedhctbek.onion/../contact-us> (referer: http://answerstedhctbek.onion/questions)
2017-11-11 14:46:47 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 http://answerstedhctbek.onion/../contact-us>: HTTP status code is not handled or not allowed
2017-11-11 14:46:48 [scrapy.core.engine] DEBUG: Crawled (400) <GET http://answerstedhctbek.onion/../questions?sort=hot> (referer: http://answerstedhctbek.onion/questions)
2017-11-11 14:46:48 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 http://answerstedhctbek.onion/../questions?sort=hot>: HTTP status code is not handled or not allowed
2017-11-11 14:46:48 [scrapy.core.engine] DEBUG: Crawled (400) <GET http://answerstedhctbek.onion/../questions?sort=votes> (referer: http://answerstedhctbek.onion/questions)
正如您所見,由於路徑不正確,我仍然收到400個錯誤。爲什麼我的代碼不能從鏈接中刪除「../」?
謝謝!
Aufziehvogel,它終於正常工作,謝謝你!直到我在makeAbsolutePath中添加'self'作爲參數之前,我無法收到上面提到的任何錯誤。因此,添加「自己」,包括您提到的所有其他解決方案已解決了這個問題網址仍然是錯誤的,但我可以簡單地包含鏈接[i] .url = links [i] .url.replace('../','') – ToriTompkins