如何從python中的字符串中提取某些信息？

我想用python從html代碼中提取某些信息。例如：如何從python中的字符串中提取某些信息？

<a href="#tips">Visit the Useful Tips Section</a> 
and I would like to get result : Visit the Useful Tips Section 

<div id="menu" style="background-color:#FFD700;height:200px;width:100px;float:left;"> 
<b>Menu</b><br /> 
HTML<br /> 
CSS<br /> 
and I would like to get Menu HTML CSS

換句話說，我希望<之間得到的一切>和<> 我想寫一個Python函數，它的HTML代碼作爲字符串，然後從那裏提取信息。我被困在string.split（'<'）。

來源

2012-06-01 user1401233

您是否嘗試過使用任何HTML解析庫？或者，您可以通過刪除所有標籤來處理文件（儘管使用'

string = '<a href="#tips">Visit the Useful Tips Section</a>' 
re.findall('<[^>]*>(.*)<[^>]*>', string) //return 'Visit the Useful Tips Section'

來源

2012-06-01 13:26:25 user278064

[我不會推薦基於正則表達式的解決方案]（http://stackoverflow.com/a/1732454/566644） –

@lazyr：取決於上下文......如果您對標記結構足夠了解並且沒有含糊不清，僅僅是一個正則表達式就可以以比完整的HTML解析器更少的開銷工作。但是你確實必須知道什麼時候正則表達式可以正常工作，什麼時候該用HTML解析器。 –

我知道你試圖去掉HTML標籤並只保留文本。

您可以定義一個表示標籤的正則表達式。然後用空字符串替換所有匹配。

例子：

def remove_html_tags(data): 
    p = re.compile(r'<.*?>') 
    return p.sub('', data)

參考文獻：

Example

Docs about python regular expressions

來源

2012-06-01 13:29:12 RumburaK

您可以使用lxml HTML解析器。

>>> import lxml.html as lh 
>>> st = ''' load your above html content into a string ''' 
>>> d = lh.fromstring(st) 
>>> d.text_content() 

'Visit the Useful Tips Section \nand I would like to get result : Visit the Useful Tips Section\n\n\nMenu\nHTML\nCSS\nand I would 
like to get Menu HTML CSS\n'

，或者你可以做

>>> for content in d.text_content().split("\n"): 
...  if content: 
...    print content 
... 
Visit the Useful Tips Section 
and I would like to get result : Visit the Useful Tips Section 
Menu 
HTML 
CSS 
and I would like to get Menu HTML CSS 
>>>

來源

2012-06-01 13:32:55 RanRag

我會用BeautifulSoup - 它會更胡思亂想着惡意形成的HTML。

來源

2012-06-01 13:44:20

相關問題