試圖從網頁抓取使用BeautifulSoup的絕對鏈接

我正在使用BeautifulSoup閱讀網頁的內容。我想要的只是抓住<a href>，以http://開頭。我知道在美麗的你可以通過屬性進行搜索。我想我只是有一個語法問題。我會想象它會像這樣。試圖從網頁抓取使用BeautifulSoup的絕對鏈接

page = urllib2.urlopen("http://www.linkpages.com") 
soup = BeautifulSoup(page) 
for link in soup.findAll('a'): 
    if link['href'].startswith('http://'): 
     print links

但返回：

Traceback (most recent call last): 
    File "<stdin>", line 2, in <module> 
    File "C:\Python26\lib\BeautifulSoup.py", line 598, in __getitem__ 
    return self._getAttrMap()[key] 
KeyError: 'href'

任何想法？提前致謝。

編輯這不是特別針對任何網站。該腳本從用戶獲取URL。所以內部鏈接目標將是一個問題，這也是爲什麼我只想從網頁中獲得<'a'>。如果我把它推向www.reddit.com，它解析開始鏈接，它會這樣：

<a href="http://www.reddit.com/top/">top</a> 
<a href="http://www.reddit.com/saved/">saved</a> 
Traceback (most recent call last): 
    File "<stdin>", line 2, in <module> 
    File "C:\Python26\lib\BeautifulSoup.py", line 598, in __getitem__ 
    return self._getAttrMap()[key] 
KeyError: 'href'

來源

2010-03-23 Kevin

reddit.com has this：。所以，這不是一個語法錯誤，它是API。 – 2010-03-23 18:48:22

from BeautifulSoup import BeautifulSoup 
import re 
import urllib2 

page = urllib2.urlopen("http://www.linkpages.com") 
soup = BeautifulSoup(page) 
for link in soup.findAll('a', attrs={'href': re.compile("^http://")}): 
    print link

來源

2010-03-23 17:38:05

你可能有一些<a>標籤不href屬性？內部鏈接目標，也許？

來源

2010-03-23 17:25:14

請給我們一個關於你在這裏解析什麼的想法 - 正如Andrew指出的那樣，似乎有一些錨標籤沒有關聯的hrefs。

如果你真的想忽略他們，你可以在一個try塊包起來，並與

except KeyError: pass

後來趕上，但它有自己的問題。

來源

2010-03-23 17:32:14

f=open('Links.txt','w') 
import urllib2 
from bs4 import BeautifulSoup 
url='http://www.redit.com' 
page=urllib2.urlopen(url) 
soup=BeautifulSoup(page) 
atags=soup.find_all('a') 
for item in atags: 
    for x in item.attrs: 
     if x=='href': 
      f.write(item.attrs[x]+',\n') 
     else: 
      continue 
f.close()

一種不太有效的解決方案。

來源

2013-02-16 00:16:09 Alex

試圖從網頁抓取使用BeautifulSoup的絕對鏈接

回答

相關問題