Python的 - 在網頁源代碼

正則表達式匹配的URL我使用這個模式的每一個網址在某個網頁匹配：Python的 - 在網頁源代碼

import re 

source = """ 
<p>https://example.com</p> 
... some code 
<font color="E80000">https://example.com</font></a> 
""" 

urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[[email protected]&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', source)

這爲我工作得很好，直到如今。我發現有時它不匹配確切的網址。就像在這個例子中，它匹配爲url https://example.com</p>和https://example.com</font></a>包含結束標記，但我無法弄清楚正則表達式中的問題。我從另一個堆棧問題中獲取這些代碼。

來源

2017-02-09 Hyperion

您使用連字符一個字符類中兩個符號之間，'[$ -_]'，創建一個範圍，可以匹配''<' and '>，所有ASCII數字和大寫信件等等。用'[ - $ _ @。＆+]'替換'[$ -_ @。＆+]''。 –

看到這個鏈接http://stackoverflow.com/questions/499345/regular-expression-to-extract-url-from-an-html-link –

你也可以檢查這個http://stackoverflow.com/questions/6883049 /正則表達式在python –

試試這個，

import re 

source = """ 
<p>https://example.com</p> 
... some code 
<font color="E80000">https://example.com</font> 
https://example.com</p></a> 
https://example.com</font></a> 
""" 
urls = re.findall('(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\[email protected]?^=%&/~+#-])?', source) 
print urls

來源

2017-02-09 09:24:02 Arun

Python的 - 在網頁源代碼

回答

相關問題