我試圖在Scrapy蜘蛛中使用urlparse.urljoin
來編譯一個url列表。目前,我的蜘蛛沒有返回,但沒有發現任何錯誤。所以我試圖檢查我是否正確地編譯了這些URL。Scrapy - urlparse.urljoin的行爲方式與str.join相同嗎?
我的嘗試是使用str.join
在閒置測試,如下圖所示:
>>> href = ['lphs.asp?id=598&city=london',
'lphs.asp?id=480&city=london',
'lphs.asp?id=1808&city=london',
'lphs.asp?id=1662&city=london',
'lphs.asp?id=502&city=london',]
>>> for x in href:
base = "http:/www.url-base.com/destination/"
final_url = str.join(base, x)
print(final_url)
的返回什麼一個行:
lhttp:/www.url-base.com/destination/phttp:/www.url-base.com/destination/hhttp:/www.url-base.com/destination/shttp:/www.url-base.com/destination/.http:/www.url-base.com/destination/ahttp:/www.url-base.com/destination/shttp:/www.url-base.com/destination/phttp:/www.url-base.com/destination/?http:/www.url-base.com/destination/ihttp:/www.url-base.com/destination/dhttp:/www.url-base.com/destination/=http:/www.url-base.com/destination/5http:/www.url-base.com/destination/9http:/www.url-base.com/destination/8http:/www.url-base.com/destination/&http:/www.url-base.com/destination/chttp:/www.url-base.com/destination/ihttp:/www.url-base.com/destination/thttp:/www.url-base.com/destination/yhttp:/www.url-base.com/destination/=http:/www.url-base.com/destination/lhttp:/www.url-base.com/destination/ohttp:/www.url-base.com/destination/nhttp:/www.url-base.com/destination/dhttp:/www.url-base.com/destination/ohttp:/www.url-base.com/destination/n
我認爲,從我的例子是很明顯,str.join
不會以相同的方式表現 - 如果確實如此,那麼這就是爲什麼我的蜘蛛沒有遵循這些鏈接! - 但是,對此有確認是很好的。
如果這不是正確的測試方法,我該如何測試這個過程?
更新使用以下urlparse.urljoin
嘗試: 從進口的urllib.parse裏urlparse
>>> from urllib.parse import urlparse
>>> for x in href:
base = "http:/www.url-base.com/destination/"
final_url = urlparse.urljoin(base, x)
print(final_url)
這是投擲AttributeError: 'function' object has no attribute 'urljoin'
更新 - 相關
def parse_links(self, response):
room_links = response.xpath('//form/table/tr/td/table//a[div]/@href').extract() # insert xpath which contains the href for the rooms
for link in room_links:
base_url = "http://www.example.com/followthrough"
final_url = urlparse.urljoin(base_url, link)
print(final_url)
# This is not joing the final_url right
yield Request(final_url, callback=parse_links)
蜘蛛功能
更新
我只是再次測試空閒:
>>> from urllib.parse import urljoin
>>> from urllib import parse
>>> room_links = ['lphs.asp?id=562&city=london',
'lphs.asp?id=1706&city=london',
'lphs.asp?id=1826&city=london',
'lphs.asp?id=541&city=london',
'lphs.asp?id=1672&city=london',
'lphs.asp?id=509&city=london',
'lphs.asp?id=428&city=london',
'lphs.asp?id=614&city=london',
'lphs.asp?id=336&city=london',
'lphs.asp?id=412&city=london',
'lphs.asp?id=611&city=london',]
>>> for link in room_links:
base_url = "http:/www.url-base.com/destination/"
final_url = urlparse.urljoin(base_url, link)
print(final_url)
其中拋出此:
Traceback (most recent call last):
File "<pyshell#34>", line 3, in <module>
final_url = urlparse.urljoin(base_url, link)
AttributeError: 'function' object has no attribute 'urljoin'
如果你的'room_links'正顯示出好的東西和'base_url'設置正確,然後那應該沒問題......你的蜘蛛的其餘部分是怎麼樣的......是否正確調用了parse_links,它是否真的需要自己產生一個回調?如果有的話 - 如果它開始爬行,它看起來會繼續爬行並且不會產生任何數據。您是否有例如定義的'start_requests'或'start_urls'? –
@JonClements基本URL設置正確,如果我拿它並手動添加相對href它工作。我使用'start_urls'而不是'start_requests'。但是,我不認爲該功能正常工作 - 請參閱更新以瞭解當我在閒置狀態下運行時會發生什麼情況? – Maverick