2013-03-13 95 views
0

我有一些HTML看起來像這樣:使用BeautifulSoup解析<tr>標籤,有麻煩提取值

<tr> 
    <td>some text</td> 
    <td>some other text</td> 
    <td>some <b>problematic</b> other <br /> text</td> 
</tr> 

和一些Python它試圖抓住標籤的值並打印出每個內在價值:

soup = BeautifulSoup(data, convertEntities=BeautifulSoup.HTML_ENTITIES) 
for row in soup.findAll('tr'): 
    print repr(row) # this prints the whole 'tr' element text just fine. 
    for col in row.contents: 
     print col.string  

所以全文正確打印拍攝的HTML,但「關口」打印無最後一個元素:

some text 
some other text 
None 

我並不熟悉BeatifulSoup或python,但它似乎是最後一個元素的內部標籤導致解析問題?

感謝

回答

0

你可以升級到BeautifulSoup版本4,並使用.stripped_strings

soup = BeautifulSoup(data) 
for row in soup.find_all('tr'): 
    print '\n'.join(row.stripped_strings) 

在BeautifulSoup 3,您需要搜索所有包含的文本而不是:

for row in soup.findAll('tr'): 
    print '\n'.join(el.strip() for row.findAll(text=True) if el.strip())