發現錨文本的時候有標籤有

下面的文字是我使用查找內容重新串：

r'''(<a([^<>]*)href=("|')(http://)?(www\.)?%s([^'"]*)("|')([^<>]*)>([^<]*))</a>''' % our_url

其結果將是這樣的：

r'''(<a([^<>]*)href=("|')(http://)?(www\.)?stackoverflow.com([^'"]*)("|')([^<>]*)>([^<]*))</a>'''

這是偉大的大多數鏈接，但它與在它的標籤的鏈接錯誤。

([^<]*))</a>'''

到：我試圖改變正則表達式的最後部分

(.*))</a>'''

但是，剛剛得到的鏈接，這是我不希望以後的頁面上的所有內容。我有什麼建議可以解決這個問題嗎？

來源

2009-03-02 Teifion

相反的：

[^<>]*

嘗試：

((?!</a).)*

換句話說，匹配是不是開始的任何字符一個</a序列。

來源

2009-03-02 17:37:13 MarkusQ

非常感謝您的幫助:) – Teifion 2009-03-02 17:45:18

我不會使用正則表達式 - 使用像Beautiful Soup這樣的HTML解析器。

來源

2009-03-02 17:32:17

似乎有點重量級這麼簡單的問題 – Teifion 2009-03-02 17:37:09

從來沒有。 HTML非常不規則 - 瀏覽器需要容忍大量的錯誤。美麗的湯可以更好地處理不規則的HTML比正則表達式可以。 – 2009-03-02 18:04:05

做一個非貪婪搜索即

(.*?)

來源

2009-03-02 17:32:35

它只能匹配到錨文本內的標記 – Teifion 2009-03-02 17:35:56

>>> import re 
>>> pattern = re.compile(r'<a.+href=[\'|\"](.+)[\'|\"].*?>(.+)</a>', re.IGNORECASE) 
>>> link = '<a href="http://stackoverflow.com/questions/603199/finding-anchor-text-when-there-are-tags-there">Finding anchor text when there are tags there</a>' 
>>> re.match(pattern, link).group(1) 
'http://stackoverflow.com/questions/603199/finding-anchor-text-when-there-are-tags-there' 
>>> re.match(pattern, link).group(2) 
'Finding anchor text when there are tags there'

來源

2009-03-03 00:13:46 riza

發現錨文本的時候有標籤有

回答

相關問題