使用python和BeautifulSoup從網頁檢索鏈接

139

下面是一個使用SoupStrainer類BeautifulSoup其中一小段：

import httplib2 
from BeautifulSoup import BeautifulSoup, SoupStrainer 

http = httplib2.Http() 
status, response = http.request('http://www.nytimes.com') 

for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('a')): 
    if link.has_attr('href'): 
     print link['href']

的BeautifulSoup文檔還算是不錯的，並且涵蓋了許多典型場景：

http://www.crummy.com/software/BeautifulSoup/documentation.html

編輯：請注意，我使用了SoupStrainer類，因爲它更高效（內存和速度方面），如果你知道你在解析什麼提前。

來源

2009-07-03 18:53:55 ars

+10

+1，使用過濾器是一個好主意，因爲它允許你繞過大量的不必要的解析，當你所有的時候都是鏈接。 – 2009-07-03 18:57:34

+0

在我看到Evan的評論之前，我編輯添加了類似的解釋。不過謝謝你的注意！ – ars 2009-07-03 19:01:16

+0

謝謝，這解決了我的問題，用這個我完成了我的項目，非常感謝 – NepUS 2009-07-03 21:17:57

25

import urllib2 
import BeautifulSoup 

request = urllib2.Request("http://www.gpsbasecamp.com/national-parks") 
response = urllib2.urlopen(request) 
soup = BeautifulSoup.BeautifulSoup(response) 
for a in soup.findAll('a'): 
    if 'national-park' in a['href']: 
    print 'found a url with national-park in the link'

來源

2009-07-03 18:37:53

4

只是爲了得到鏈接，沒有B.soup和正則表達式：

import urllib2 
url="http://www.somewhere.com" 
page=urllib2.urlopen(url) 
data=page.read().split("</a>") 
tag="<a href=\"" 
endtag="\">" 
for item in data: 
    if "<a href" in item: 
     try: 
      ind = item.index(tag) 
      item=item[ind+len(tag):] 
      end=item.index(endtag) 
     except: pass 
     else: 
      print item[:end]

對於更復雜的操作，當然BSoup的仍然是首選。

來源

2009-07-04 03:11:21 ghostdog74

44

其他人推薦BeautifulSoup，但使用lxml要好得多。儘管它的名字，它也用於解析和刮取HTML。它比BeautifulSoup快得多，它甚至比BeautifulSoup（他們的聲望）更好地處理「破碎的」HTML。如果您不想學習lxml API，它也具有用於BeautifulSoup的兼容性API。

Ian Blicking agrees。

沒有理由再使用BeautifulSoup，除非您使用的是Google App Engine或其他任何不是純粹Python不允許的東西。

lxml.html還支持CSS3選擇器，所以這種事情是微不足道的。

一個例子與LXML和XPath看起來像這樣：

import urllib 
import lxml.html 
connection = urllib.urlopen('http://www.nytimes.com') 

dom = lxml.html.fromstring(connection.read()) 

for link in dom.xpath('//a/@href'): # select the url in href for all a tags(links) 
    print link

來源

2009-08-03 15:34:01 aehlke

2

爲什麼不使用正則表達式：

import urllib2 
import re 
url = "http://www.somewhere.com" 
page = urllib2.urlopen(url) 
page = page.read() 
links = re.findall(r"<a.*?\s*href=\"(.*?)\".*?>(.*?)</a>", page) 
for link in links: 
    print('href: %s, HTML text: %s' % (link[0], link[1]))

來源

2012-05-27 01:49:47 ahmadh

6

引擎蓋下BeautifulSoup現在使用LXML。請求，lxml &列表解析使殺手組合。

import requests 
import lxml.html 

dom = lxml.html.fromstring(requests.get('http://www.nytimes.com').content) 

[x for x in dom.xpath('//a/@href') if '//' in x and 'nytimes.com' not in x]

在列表排版中，「如果‘//’和‘url.com’不在x」是一個簡單的方法來擦洗站點「內部」導航URL等等的URL列表

來源

2013-10-07 10:46:27 cheekybastard

8

以下代碼是檢索所有使用的urllib2和BeautifulSoup4

import urllib2 
    from bs4 import BeautifulSoup 
    url = urllib2.urlopen("http://www.espncricinfo.com/").read() 
    soup = BeautifulSoup(url) 
    for line in soup.find_all('a'): 
      print(line.get('href'))

來源

2014-02-07 14:17:08 Sentient07

39

爲了完整起見，BeautifulSoup版本4在網頁中的可用鏈路，利用由服務器提供的，以及所述編碼的：

from bs4 import BeautifulSoup 
import urllib2 

resp = urllib2.urlopen("http://www.gpsbasecamp.com/national-parks") 
soup = BeautifulSoup(resp, from_encoding=resp.info().getparam('charset')) 

for link in soup.find_all('a', href=True): 
    print link['href']

或Python的3版本：

from bs4 import BeautifulSoup 
import urllib.request 

resp = urllib.request.urlopen("http://www.gpsbasecamp.com/national-parks") 
soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset')) 

for link in soup.find_all('a', href=True): 
    print(link['href'])

以及使用該requests library，其爲寫在版本將在兩個Python 2和3工作：

from bs4 import BeautifulSoup 
from bs4.dammit import EncodingDetector 
import requests 

resp = requests.get("http://www.gpsbasecamp.com/national-parks") 
http_encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None 
html_encoding = EncodingDetector.find_declared_encoding(resp.content, is_html=True) 
encoding = html_encoding or http_encoding 
soup = BeautifulSoup(resp.content, from_encoding=encoding) 

for link in soup.find_all('a', href=True): 
    print(link['href'])

的soup.find_all('a', href=True)呼叫發現具有href屬性的所有<a>元件;沒有屬性的元素被跳過。

BeautifulSoup 3於2012年3月停止開發;新項目真的應該永遠使用BeautifulSoup 4。

請注意，您應該將HTML從字節解碼爲BeautifulSoup。您可以通知BeautifulSoup HTTP響應頭文件中找到的字符集來協助解碼，但是這個可能是是錯誤的，並且與在HTML本身中找到的<meta>標頭信息衝突，這就是爲什麼上面使用BeautifulSoup內部類方法EncodingDetector.find_declared_encoding()以確保這種嵌入式編碼提示勝過錯誤配置的服務器。

使用requests時，即使未返回任何字符集，如果響應具有text/* mimetype，response.encoding屬性將默認爲拉丁-1。這與HTTP RFC一致，但與HTML解析一起使用時很痛苦，所以如果Content-Type標頭中沒有設置charset，應該忽略該屬性。

來源

2014-03-22 20:52:44

0

import urllib2 
from bs4 import BeautifulSoup 
a=urllib2.urlopen('http://dir.yahoo.com') 
code=a.read() 
soup=BeautifulSoup(code) 
links=soup.findAll("a") 
#To get href part alone 
print links[0].attrs['href']

來源

2014-09-04 19:00:16

3

這個腳本是做你要找的，但也解決了絕對鏈接的相對鏈接。

來源

2015-01-21 21:10:19

5

要找到所有的聯繫，我們將在這個例子中使用的urllib2模塊一起與re.module *一重模塊中功能最強大的是「re.findall （）」。雖然re.search（）來找到一個模式的第一場比賽，re.findall（）發現所有的比賽，將它們作爲一個字符串列表，與代表一個比賽每串*

import urllib2 

import re 
#connect to a URL 
website = urllib2.urlopen(url) 

#read html code 
html = website.read() 

#use re.findall to get all the links 
links = re.findall('"((http|ftp)s?://.*?)"', html) 

print links

來源

2015-08-06 03:22:40

1

BeatifulSoup自己的解析器可能會很慢。使用能夠直接從URL進行解析的lxml可能更爲可行（有一些下面提到的限制）。上述

import lxml.html 

doc = lxml.html.parse(url) 

links = doc.xpath('//a[@href]') 

for link in links: 
    print link.attrib['href']

的代碼將返回鏈接的是，在大多數情況下，他們會從站點根目錄相對鏈接或絕對的。由於我的使用案例只是提取某種類型的鏈接，因此下面是一個版本，可將鏈接轉換爲完整的URL，並可選擇接受像*.mp3這樣的glob模式。它不會處理相對路徑中的單點和雙點，但到目前爲止，我並不需要它。如果您需要解析包含../或./的URL片段，則urlparse.urljoin可能會派上用場。

注意：直接LXML URL解析不會https處理負荷，因此這個原因下面的版本使用urllib2 + lxml沒有做重定向。

#!/usr/bin/env python 
import sys 
import urllib2 
import urlparse 
import lxml.html 
import fnmatch 

try: 
    import urltools as urltools 
except ImportError: 
    sys.stderr.write('To normalize URLs run: `pip install urltools --user`') 
    urltools = None 


def get_host(url): 
    p = urlparse.urlparse(url) 
    return "{}://{}".format(p.scheme, p.netloc) 


if __name__ == '__main__': 
    url = sys.argv[1] 
    host = get_host(url) 
    glob_patt = len(sys.argv) > 2 and sys.argv[2] or '*' 

    doc = lxml.html.parse(urllib2.urlopen(url)) 
    links = doc.xpath('//a[@href]') 

    for link in links: 
     href = link.attrib['href'] 

     if fnmatch.fnmatch(href, glob_patt): 

      if not href.startswith(('http://', 'https://' 'ftp://')): 

       if href.startswith('/'): 
        href = host + href 
       else: 
        parent_url = url.rsplit('/', 1)[0] 
        href = urlparse.urljoin(parent_url, href) 

        if urltools: 
         href = urltools.normalize(href) 

      print href

的用法如下：

getlinks.py http://stackoverflow.com/a/37758066/191246 
getlinks.py http://stackoverflow.com/a/37758066/191246 "*users*" 
getlinks.py http://fakedomain.mu/somepage.html "*.mp3"

來源

2016-06-10 22:38:00 ccpizza

0

下面是使用requests接受的答案@ars和BeautifulSoup4，和wget模塊來處理下載的例子。

import requests 
import wget 
import os 

from bs4 import BeautifulSoup, SoupStrainer 

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/eeg-mld/eeg_full/' 
file_type = '.tar.gz' 

response = requests.get(url) 

for link in BeautifulSoup(response.content, 'html.parser', parse_only=SoupStrainer('a')): 
    if link.has_attr('href'): 
     if file_type in link['href']: 
      full_path = url + link['href'] 
      wget.download(full_path)

來源

2016-07-11 18:58:08 Blairg23

0

我發現@ Blairg23工作，下面的修正後的答案（覆蓋場景下未能正常工作）：

for link in BeautifulSoup(response.content, 'html.parser', parse_only=SoupStrainer('a')): 
    if link.has_attr('href'): 
     if file_type in link['href']: 
      full_path =urlparse.urljoin(url , link['href']) #module urlparse need to be imported 
      wget.download(full_path)

對於Python 3：

urllib.parse.urljoin有用於獲取完整的URL。

來源

2017-05-25 16:03:12

使用python和BeautifulSoup從網頁檢索鏈接

回答

相關問題