從BeautifulSoup解析HTML中刪除標籤

我是python的新手，我正在使用BeautifulSoup解析網站，然後提取數據。我有以下代碼：從BeautifulSoup解析HTML中刪除標籤

for line in raw_data: #raw_data is the parsed html separated into smaller blocks 
    d = {} 
    d['name'] = line.find('div', {'class':'torrentname'}).find('a') 
    print d['name'] 

<a href="/ubuntu-9-10-desktop-i386-t3144211.html"> 
<strong class="red">Ubuntu</strong> 9.10 desktop (i386)</a>

通常情況下，我將能夠提取物 '的Ubuntu 9.10桌面（I386）' 通過寫

d['name'] = line.find('div', {'class':'torrentname'}).find('a').string

，但由於強烈的html標籤返回None。有沒有辦法提取強標籤，然後使用.string或有更好的方法嗎？我曾嘗試使用BeautifulSoup的extract（）函數，但是我無法使其工作。

編輯：我剛剛意識到，如果有兩組強標記因爲這兩個詞之間的空白被遺漏，我的解決方案不起作用。什麼是解決這個問題的方法？

來源

2010-08-27 FlowofSoul

相關：http://stackoverflow.com/questions/598817/python-html-removal/599080＃599080 – jfs 2011-01-09 22:06:19

使用「的.text」屬性：

d['name'] = line.find('div', {'class':'torrentname'}).find('a').text

還是做的findAll聯接（文= TRUE）：

anchor = line.find('div', {'class':'torrentname'}).find('a') 
d['name'] = ''.join(anchor.findAll(text=True))

來源

2010-08-29 03:54:02

這不起作用。它不會像這樣保持空格： Ubuntu Linux。它以UbuntuLinux的形式出現。 – FlowofSoul 2010-08-29 04:24:05

我已經用附加選項更新了答案。 – 2010-08-29 05:29:17

非常感謝，非常棒！你能解釋第二行代碼的工作原理嗎？ – FlowofSoul 2010-08-29 15:29:33

從BeautifulSoup解析HTML中刪除標籤

回答

相關問題