2016-05-17 92 views
1

如何提取網址,我試圖從網頁與以下模式中提取URL:模式匹配的

http://www.realclearpolitics.com/epolls/????/governor/??/-的.html」

我當前的代碼提取所有鏈接。我怎樣才能改變我的代碼,只提取符合模式的網址?謝謝!

import requests 
from bs4 import BeautifulSoup 

def find_governor_races(html): 
    url = html 
    base_url = 'http://www.realclearpolitics.com/' 
    page = requests.get(html).text 
    soup = BeautifulSoup(page,'html.parser') 
    links = [] 
    for a in soup.findAll('a', href=True): 
      links.append(a['href']) 
find_governor_races('http://www.realclearpolitics.com/epolls/2010/governor/2010_elections_governor_map.html') 

回答

1

您可以爲.find_all()一個regular expression pattern作爲href參數值:

import re 

pattern = re.compile(r"http://www.realclearpolitics.com\/epolls/\d+/governor/.*?/.*?.html") 
links = soup.find_all("a", href=pattern) 
+0

謝謝你這麼多。這真的有幫助 – user6283465