模式匹配的

如何提取網址，我試圖從網頁與以下模式中提取URL：模式匹配的

「http://www.realclearpolitics.com/epolls/????/governor/??/-的.html」

我當前的代碼提取所有鏈接。我怎樣才能改變我的代碼，只提取符合模式的網址？謝謝！

import requests 
from bs4 import BeautifulSoup 

def find_governor_races(html): 
    url = html 
    base_url = 'http://www.realclearpolitics.com/' 
    page = requests.get(html).text 
    soup = BeautifulSoup(page,'html.parser') 
    links = [] 
    for a in soup.findAll('a', href=True): 
      links.append(a['href']) 
find_governor_races('http://www.realclearpolitics.com/epolls/2010/governor/2010_elections_governor_map.html')

來源

2016-05-17 user6283465

您可以爲.find_all()一個regular expression pattern作爲href參數值：

import re 

pattern = re.compile(r"http://www.realclearpolitics.com\/epolls/\d+/governor/.*?/.*?.html") 
links = soup.find_all("a", href=pattern)

來源

2016-05-17 20:08:11 alecxe

謝謝你這麼多。這真的有幫助 – user6283465

回答

相關問題