過濾使用BeautifulSoup獲得的URL

我正在編寫一個程序，目標是從網站獲取鏈接的標題，但前提是該鏈接指向某個網站。過濾使用BeautifulSoup獲得的URL

到目前爲止，我可以使用BeautifulSoup獲取頁面上的錨定標記列表（包括href =「url」位），我想將它們濾出，最好使用正則表達式。

我想抓取的鏈接格式爲：「http://section.website.com/123456」，其中123456是任意6位數字。我已經嘗試了下面的代碼，但是沒有任何內容被添加到數據數組中。

import urllib2 
from BeautifulSoup import BeautifulSoup 
import re 

opener = urllib2.build_opener() 
opener.addheaders = [('User-agent', 'Mozilla/5.0')] 

url = ('http://awebsite.com') 

ourUrl = opener.open(url).read() 

soup = BeautifulSoup(ourUrl) 

links = soup.findAll('a') 
data = [] 
for i in links: 
    print i 
for i in links: 
    if "http://section.website.com/\d+" in i: 
     data.append(i.text) 
for entry in data: 
    print entry 

raw_input()

來源

2014-10-09 ACrazyChemist

你爲什麼使用BeautifulSoup 3的具體原因是什麼？它幾年前就被封存了，BeautifulSoup 4在這裏給你更多的靈活性。 – 2014-10-09 09:26:01

您可以完全離開濾波BeautifulSoup：

links = soup.findAll('a', href=re.compile('^http://section.website.com/\d{6}$'))

這將只匹配具有正好數字鏈接，並沒有其他的鏈接將被包括在結果集。

您的代碼失敗，因爲您需要針對href屬性進行測試，而不是使用正則表達式，只是純文本。下面會去朝過濾方式的一部分沒有一個正則表達式：

if "http://section.website.com/" in i.get('href', ''):

但不會測試數字，或測試是否正確的網址與文字開始。

您可能需要升級到BeautifulSoup版本4;你正在使用2年前被封存的BeautifulSoup 3。所有新功能和錯誤修正都轉到版本4。

來源

2014-10-09 09:27:41

過濾使用BeautifulSoup獲得的URL

回答

相關問題