scrapy LxmlLinkExtractor和相對URL

正確的網址，我應該結束了與我的原則是：http://www.lecture-en-ligne.com/towerofgod/168/0/0/1.html scrapy LxmlLinkExtractor和相對URL

scrapys獲得相對URL以及從源：

<a class="table" href="../../towerofgod/168/0/0/1.html">Lire en ligne</a>

但隨後爬不好思考雙點斜槓雙點是下一個網址的一部分...

我應該轉換我從LxmlLinkExtractor與自定義process_value得到的雙相對url嗎？

scrapy是否正確處理相對URL，我的意思是它的行爲？

2014-12-06 17：20：05 + 0100 [togspider] DEBUG：Crawled（200）http://www.lecture-en-ligne.com/manga/towerofgod/>（referer：None）

2014-12-06 17：20：05 + 0100 [togspider] DEBUG：Retrying http://www.lecture-en-ligne.com/../../towerofgod/160/0/0/1 html的>（失敗1次）：400錯誤的請求

class TogSpider(CrawlSpider): 
name = "togspider" 
allowed_domains = ["lecture-en-ligne.com"] 
start_urls = ["http://www.lecture-en-ligne.com/manga/towerofgod/"] 

rules = (
    Rule(LxmlLinkExtractor(allow_domains=allowed_domains, 
          restrict_xpaths='.//*[@id="page"]/table[2]/tbody/tr[10]/td[2]/a'), callback='parse_chapter'), 
    )

來源

2014-12-06 euri10

http://stackoverflow.com/a/19773661/3581357給出了一個答案。實現這種方式，但仍然想知道這是否意圖:) def process_links（鏈接）： links = re.sub（r'\。\。\ /'，''，links）返回鏈接 – euri10 2014-12-06 16:43:39

的問題是，HTML有一個不正確的HTML base element，這是應該指定的基本URL的網頁的所有相關鏈接：

<base href="http://www.lecture-en-ligne.com/"/>

Scrapy是尊重這一點，這就是爲什麼鏈接正在形成的方式。

來源

2014-12-06 17:31:06 elias

謝謝你清楚說明 – euri10 2014-12-06 17:57:33

@ euri10不客氣！抱歉沒有比你已經找到的更好的解決方法。 – elias 2014-12-06 17:59:04

scrapy LxmlLinkExtractor和相對URL

回答

相關問題