2017-11-11 569 views
0

我是Scrapy的新手,目前我正在嘗試編寫一個CrawlSpider來抓取Tor darknet上的論壇。目前我CrawlSpider代碼:如何使用我的scrapy CrawlSpider將相對路徑轉換爲絕對路徑?

import scrapy 
from scrapy.spiders import CrawlSpider, Rule 
from scrapy.linkextractors import LinkExtractor 

class HiddenAnswersSpider(CrawlSpider): 
    name = 'ha' 
    start_urls = ['http://answerstedhctbek.onion/questions'] 
    allowed_domains = ['http://answerstedhctbek.onion', 'answerstedhctbek.onion'] 
    rules = (
      Rule(LinkExtractor(allow=(r'answerstedhctbek.onion/\d\.\*', r'https://answerstedhctbek.onion/\d\.\*')), follow=True, process_links='makeAbsolutePath'), 
      Rule(LinkExtractor(allow=()), follow=True, process_links='makeAbsolutePath') 

      ) 

def makeAbsolutePath(links): 
    for i in range(links): 
      links[i] = links[i].replace("../","") 
    return links 

由於論壇使用相對路徑,我試圖創建一個自定義process_links去掉「../」但是當我跑我的代碼我仍然recieving:

2017-11-11 14:46:46 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 http://answerstedhctbek.onion/../badges>: HTTP status code is not handled or not allowed 
2017-11-11 14:46:46 [scrapy.core.engine] DEBUG: Crawled (400) <GET http://answerstedhctbek.onion/../general-guidelines> (referer: http://answerstedhctbek.onion/questions) 
2017-11-11 14:46:47 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 http://answerstedhctbek.onion/../general-guidelines>: HTTP status code is not handled or not allowed 
2017-11-11 14:46:47 [scrapy.core.engine] DEBUG: Crawled (400) <GET http://answerstedhctbek.onion/../contact-us> (referer: http://answerstedhctbek.onion/questions) 
2017-11-11 14:46:47 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 http://answerstedhctbek.onion/../contact-us>: HTTP status code is not handled or not allowed 
2017-11-11 14:46:48 [scrapy.core.engine] DEBUG: Crawled (400) <GET http://answerstedhctbek.onion/../questions?sort=hot> (referer: http://answerstedhctbek.onion/questions) 
2017-11-11 14:46:48 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 http://answerstedhctbek.onion/../questions?sort=hot>: HTTP status code is not handled or not allowed 
2017-11-11 14:46:48 [scrapy.core.engine] DEBUG: Crawled (400) <GET http://answerstedhctbek.onion/../questions?sort=votes> (referer: http://answerstedhctbek.onion/questions) 

正如您所見,由於路徑不正確,我仍然收到400個錯誤。爲什麼我的代碼不能從鏈接中刪除「../」?

謝謝!

回答

0

問題可能是makeAbsolutePaths不是蜘蛛類的一部分。 The documentation states

process_links is a callable, or a string (in which case a method from the spider object with that name will be used)

你沒有使用makeAbsolutePathsself,所以我認爲它不是一個縮進錯誤。 makeAbsolutePaths也有一些其他的錯誤。如果我們的代碼糾正了這種狀態:

import scrapy 
from scrapy.spiders import CrawlSpider, Rule 
from scrapy.linkextractors import LinkExtractor 


class HiddenAnswersSpider(CrawlSpider): 
    name = 'ha' 
    start_urls = ['file:///home/user/testscrapy/test.html'] 
    allowed_domains = [] 
    rules = (
      Rule(LinkExtractor(allow=(r'.*')), follow=True, process_links='makeAbsolutePath'), 
      ) 

    def makeAbsolutePath(self, links): 
     print(links) 
     for i in range(links): 
      links[i] = links[i].replace("../","") 
     return links 

它會產生這樣的錯誤:

TypeError: 'list' object cannot be interpreted as an integer 

這是因爲在調用用於rangelen()沒有呼叫和range只能操作上整數。它想一個數字,會給你的範圍從0到這個數減1

修復此問題後,它會給出錯誤:

AttributeError: 'Link' object has no attribute 'replace' 

這是 - 因爲不像你想 - links是不是包含href=""屬性內容的字符串列表。相反,它是一個Link對象的列表。

我建議你在makeAbsolutePath的內部輸出links的內容,看看你是否需要做任何事情。在我看來,即使該網站使用..運營商沒有實際的文件夾級別(因爲該URL是/questions而不是/questions/),scrapy應該已經停止解析..運營商一旦達到域級別,因此您的鏈接應該指向http://answerstedhctbek.onion/<number>/<title>

不知怎的,像這樣:

def makeAbsolutePath(self, links): 
     for i in range(len(links)): 
      print(links[i].url) 

     return [] 

(這裏返回一個空表給你的優點是,蜘蛛將停止,您可以檢查控制檯輸出)

如果你再發現,該網址實際上是錯誤的,你可以通過url屬性執行一些工作:

links[i].url = 'http://example.com' 
+0

Aufziehvogel,它終於正常工作,謝謝你!直到我在makeAbsolutePath中添加'self'作爲參數之前,我無法收到上面提到的任何錯誤。因此,添加「自己」,包括您提到的所有其他解決方案已解決了這個問題網址仍然是錯誤的,但我可以簡單地包含鏈接[i] .url = links [i] .url.replace('../','') – ToriTompkins