如何使用正則表達式從html標籤中提取文本？

我需要從textarea標籤中提取文本。如何使用正則表達式從html標籤中提取文本？

如何使用正則表達式來做到這一點？

<textarea rows="20" cols="70" name="file" id="file" style="width: 100%;"data-input-file="1"> 
abc_text 
#include<abc> 
xyz 
</textarea>

來源

2015-12-21 vidhan

您可以使用xml解析庫從xml中精確確定數據。例如'lxml'，我們可以從正則表達式中找到，但根據我的說法是有風險的。 –

我曾嘗試使用Beautifulsoup，但textarea包含'<>'以及它給予不希望的結果。 – vidhan

'soup.find（'textarea'）.text' –

你可以試試，

>>> print [x.strip() for x in re.findall('<textarea.*?>(.*)</textarea>', content, re.MULTILINE | re.DOTALL)] 
['abc_text\n #include<abc>\n xyz']

來源

2015-12-21 10:13:23

增加投票，如果'textarea'標籤中的內容包含新行，即'\ n'，那麼上面的re會有效嗎？ –

XML根據XML規則無效。打開和結束標籤不匹配。

#include<abc>

<abc>是開標籤，而不是內容。

XML解析庫不會解析無效輸入。

修改輸入：

如果更改#include<abc>到#include<abc>那麼下面的工作：

>>> import lxml.html as PARSER 
>>> root = PARSER.fromstring(data) 
>>> root.xpath("//textarea/text()") 
['\n abc_text\n #include<abc>\n xyz\n'] 
>>>

通過RE：

>>> data 
'<textarea rows="20" cols="70" name="file" id="file" style="width: 100%;"data-input-file="1">\n abc_text\n</textarea>' 
>>> import re 
>>> re.findall('<textarea[^>]*>[^<]*</textarea>', data) 
['<textarea rows="20" cols="70" name="file" id="file" style="width: 100%;"data-input-file="1">\n abc_text\n</textarea>'] 
>>>

來源

2015-12-21 10:13:35

如何使用正則表達式從html標籤中提取文本？

回答

相關問題