正則表達式來刪除外部鏈接與出文字

-1

This is a <a href="https://www.test.com">test1</a>. This is <a href="/node/1">test2</a>. This is <a href="https://nct.com">test3</a>. This is a <a href="www.test.com">test4</a>. This is a <a href="http://test.com">test5</a>.

nct.com是我的網站。我不想刪除包含在標籤內的鏈接和文本。那麼/ node/1。

我期待的輸出是

This is a test1. This is <a href="/node/1">test2</a>. This is <a href="https://nct.com">test3</a>. This is a test4. This is a test5.

凡爲外部網站如test.com，我想一個標籤內容不去除包裝標籤中的文本。我使用

釷正則表達式是

#<a [^>]*\bhref=(['"])http.?://((?<!mywebsite)[^'"])+\1 *.*?</a>#i

這將刪除標記內容以及在標籤中的文本。

來源

2017-10-11 Fazeela Abu Zohra

你需要nct.com和/ node/1在正則表達式中硬編碼還是隻有url沒有http（s）？ – Wouter

我創建了一個正則表達式，做什麼，我想你需要：

/<a [^>]*\bhref=(['"])((https?:\/\/|www.)((?!nct\.com).)(.*?))['"]*\b<\/a>/

test

來源

2017-10-11 13:33:37 Wouter

正則表達式不適合我。我已經更新了這個問題，可否請你幫我解決。 –

@FazeelaAbuZohra我更新了正則表達式（和測試網址），它不是最乾淨的一個，但匹配更新後的問題中的所有無效網址。 – Wouter

你可以試試這個：

import re 
s = 'This is a <a href="https://www.test.com">test1</a>. This is <a href="/node/1">test2</a>. This is <a href="https://nct.com">test3</a>. This is a <a href="www.test.com">test4</a>. This is a <a href="http://test.com">test5</a>.' 
final_list = [re.findall("^[a-zA-Z\s]+", i)[0]+re.findall('com">(.*?)</a>', i)[0] if "nct.com" not in i and "node" not in i else i for i in re.split("\.\s(?=This)", s)]

輸出：

['This is a test1', 'This is <a href="/node/1">test2</a>', 'This is <a href="https://nct.com">test3</a>', 'This is a test4', 'This is a test5']

來源

2017-10-21 21:41:29 Ajax1234

正則表達式來刪除外部鏈接與出文字

回答

相關問題