BeautifulSoup在同一網站返回頁面的URL縮短

我的代碼以供參考：BeautifulSoup在同一網站返回頁面的URL縮短

import httplib2 
from bs4 import BeautifulSoup 

h = httplib2.Http('.cache') 
response, content = h.request('http://csb.stanford.edu/class/public/pages/sykes_webdesign/05_simple.html') 
soup = BeautifulSoup(content, "lxml") 
urls = [] 
for tag in soup.findAll('a', href=True): 
    urls.append(tag['href']) 
responses = [] 
contents = [] 
for url in urls: 
    try: 
     response1, content1 = h.request(url) 
     responses.append(response1) 
     contents.append(content1) 
    except: 
     pass

的想法是，我得到一個網頁的有效載荷，然後刮，對超鏈接。其中一個環節是yahoo.com，其他爲「http://csb.stanford.edu/class/public/index.html」

不過我是從BeautifulSoup得到的結果是：

>>> urls 
['http://www.yahoo.com/', '../../index.html']

這提出了一個問題，因爲腳本的第二部分無法在第二個縮短的網址上執行。有沒有辦法讓BeautifulSoup檢索完整的網址？

來源

2017-05-03 Joseph O' Connell

這是因爲網頁上的鏈接實際上就是這種形式。從頁面的HTML是：

<p>Or let's just link to <a href=../../index.html>another page on this server</a></p>

這就是所謂的相對鏈接。

要將其轉換爲絕對鏈接，您可以使用標準庫中的urljoin。

from urllib.parse import urljoin # Python3 

urljoin('http://csb.stanford.edu/class/public/pages/sykes_webdesign/05_simple.html`, 
     '../../index.html') 
# returns http://csb.stanford.edu/class/public/index.html

來源

2017-05-03 17:54:24 MinchinWeb

當然，謝謝。我可能會在例外中包含url連接部分。 –

關於進一步的想法，我不會打擾，它只是針對單個網頁，不值得打擾 –

BeautifulSoup在同一網站返回頁面的URL縮短

回答

相關問題