提取從源代碼URL與Python 3

-1

我的問題是參照下列之一： How to extract URL from HTML anchor element using Python3?提取從源代碼URL與Python 3

如果我不知道確切的網址，只是有這應該是出現在URL中的關鍵詞？那麼我怎樣才能從頁面源中提取網址？

2015-02-06 cranberry

呃...提取所有這些，並依次檢查每個。 – 2015-02-06 07:04:41

嘗試使用正則表達式

import re 
re.findall(r'(?i)href=["\']([^\s"\'<>]+)', content)

來源

2015-02-06 07:13:13

使用的HTML解析器。

在BeautifulSoup情況下，你可以通過一個function作爲關鍵字參數值：

from bs4 import BeautifulSoup 

word = "test" 
data = "your HTML here" 
soup = BeautifulSoup(data) 

for a in soup.find_all('a', href=lambda x: x and word in x): 
    print(a['href'])

或者，regular expression：

import re 

for a in soup.find_all('a', href=re.compile(word)): 
    print(a['href'])

或者，使用CSS selector：

for a in soup.select('a[href^="{word}"]'.format(word=word)): 
    print(a['href'])

來源

2015-02-06 07:27:39 alecxe

提取從源代碼URL與Python 3

回答

相關問題