BeautifulSoup標籤去除

我一直在找解析HTML表格與Python/BeautifulSoup ...BeautifulSoup標籤去除

這是我在Python編碼什麼的第一次嘗試，所以它可能不是最有效的。

我在這裏找到了另一個帖子（大部分作品很棒），但我遇到了一些問題。

我運行的代碼是在這裏：

def strip_tags(html, invalid_tags): 
    bs2 = BeautifulSoup(str(html)) 
    for tag in bs2.findAll(True): 
     if tag.name in invalid_tags: 
      s = ""  

      for c in tag.contents: 
       if not isinstance(c, NavigableString): 
        c = strip_tags(unicode(c), invalid_tags) 
       s += unicode(c) 

      tag.replaceWith(s) 
    return bs2 

invalid_tags = ['td','b'] 

for row in bs.findAll('tr'): 
    col = row.findAll('td') 

for index,item in enumerate(col): 
    t = item.findAll('a') 
    for ta in t: 
     ta.replaceWithChildren() 
     col[index] == item 

for item in col: 
    print(strip_tags(item.string,invalid_tags).string

的原始數據表（HTML）看起來是這樣的：

<td align="left">11/10</td> 
<td>N ARMY</td> 
<td>-7.5</td> 
<td>NL</td> 
<td><b>76-65</b></td> 
<td><span style="color:green">W</span></td> 
<td><span style="color:green">W</span></td> 
<td></td> 
<td class="cell4">50.0%</td> 
<td class="cell4">76.9%</td> 
<td class="cell4">37.5%</td> 
<td class="cell5">37.1%</td> 
<td class="cell5">90.0%</td> 
<td class="cell5">29.4%</td>

當我運行strip_tags的功能，它適用於所有標籤第二行除外......「None」作爲輸出返回。

如果任何人都可以提供任何有關發生這種情況的見解，我將不勝感激。

編輯：哇謝謝大家的快速回復。總之，這裏是當我運行的代碼會發生什麼：

 
11/10 
None 
-7.5 
NL 
76-65 
W 
W 
None 
50.0% 
76.9% 
37.5% 
37.1% 
90.0% 
29.4%

問題在於圍繞第二線，它會返回，而不是「N軍隊」「無」。所以，是的，理想情況下，我只想在標籤中找到的文字。

來源

2013-04-10 user2267232

是你在找什麼輸出呢？ BeautifulSoup也有'.stripped_strings'迭代，這使得大多數這不必要，如果你想要的只是表中的文本。 – 2013-04-10 19:27:30

您的縮進看起來不對;對於index，枚舉項（col）：'和'for col in：'塊應該*可能*被縮進爲前面'for'循環的一部分。 – 2013-04-10 19:29:16

你提供了輸入html，但我很困惑你想從它輸出什麼。你能發佈它*應該*返回的內容嗎？ – 2013-04-10 19:30:13

如果我正確理解您想要的輸出，您不需要手動去除標籤 - 這就是爲什麼我們使用BeautifulSoup！ ;）

您需要調用的是返回的tag實例上的get_text()方法。

使用您的樣本HTML：

<table> 
    <tr> 
     <td align="left">11/10</td> 
     <td>N ARMY</td> 
     <td>-7.5</td> 
     <td>NL</td> 
     <td><b>76-65</b></td> 
     <td><span style="color:green">W</span></td> 
     <td><span style="color:green">W</span></td> 
     <td></td> 
     <td class="cell4">50.0%</td> 
     <td class="cell4">76.9%</td> 
     <td class="cell4">37.5%</td> 
     <td class="cell5">37.1%</td> 
     <td class="cell5">90.0%</td> 
     <td class="cell5">29.4%</td> 
    </tr> 
</table>

一個簡單的迭代在td S，並get_text()打個電話，我們好去！

from bs4 import BeautifulSoup 

with open('test.html', 'rb') as html: #My local version of your html file 
    soup = BeautifulSoup(html.read()) 

for td in soup.find_all('td'): 
    print td.get_text()

這使輸出：

11/10 
N ARMY 
-7.5 
NL 
76-65 
W 
W 

50.0% 
76.9% 
37.5% 
37.1% 
90.0% 
29.4% 
[Finished in 0.1s]

來源

2013-04-10 20:19:49

真棒，這很好。非常感謝。 – user2267232 2013-04-10 21:41:29

BeautifulSoup標籤去除

回答

相關問題