如何僅使用BeautifulSoup打印某些文本

我想爲使用BeautifulSoup的城市政府提取一些財務數據（必須從pdf轉換文件）。我只想將數據作爲csv文件獲取，然後我將使用Excel或SAS進行分析。我的問題是我不想打印「& nbsp;」即原始HTML中的數字和行標題。有關如何在不使用正則表達式的情況下做到這一點的任何建議？如何僅使用BeautifulSoup打印某些文本

下面是我正在看的html的一個示例。接下來是我的代碼（目前僅用於驗證概念模式，需要證明我可以在繼續之前獲得乾淨的數據）。新的Python和編程，所以任何幫助表示讚賞。

<TD class="td1629">Investments (Note 2)</TD> 

<TD class="td1605">&nbsp;</TD> 

<TD class="td479">&nbsp;</TD> 

<TD class="td1639">-</TD> 

<TD class="td386">&nbsp;</TD> 

<TD class="td116">&nbsp;</TD> 

<TD class="td1634">2,207,592</TD> 

<TD class="td479">&nbsp;</TD> 

<TD class="td1605">&nbsp;</TD> 

<TD class="td1580">2,207,592</TD> 

<TD class="td301">&nbsp;</TD> 

<TD class="td388">&nbsp;</TD> 

<TD class="td1637">2,882,018</TD>

CODE

import htmllib 
import urllib 
import urllib2 
import re 
from BeautifulSoup import BeautifulSoup 

CAFR = open("C:/Users/snown/Documents/CAFR2004 BFS Statement of Net Assets.html", "r") 

soup = BeautifulSoup(CAFR) 

assets_table = soup.find(True, id="page_27").find(True, id="id_1").find('table') 

rows = assets_table.findAll('tr')  
for tr in rows:  
    cols = tr.findAll('td')  
    for td in cols:  
    text = ''.join(td.find(text=True)) 
    print text+"|",  
    print

來源

2011-11-02 snown

soup = BeautifulSoup(html, convertEntities=BeautifulSoup.HTML_ENTITIES)

它轉換 和其它HTML實體適當的字符。

將其寫入到一個CSV文件：

>>> import csv 
>>> import sys 
>>> csv_file = sys.stdout 
>>> writer = csv.writer(csv_file, delimiter="|") 
>>> soup = BeautifulSoup("<tr><td>1<td>&nbsp;<td>3", 
...      convertEntities=BeautifulSoup.HTML_ENTITIES) 
>>> writer.writerows([''.join(t.encode('utf-8') for t in td(text=True)) 
...     for td in tr('td')] for tr in soup('tr')) 
1| |3

我用t.encode('utf-8')由於 被翻譯成非ASCII U+00A0（無間斷空格）字符。

來源

2011-11-02 05:26:22 jfs

很酷，謝謝@塞巴斯蒂安。再往前走一步，有沒有辦法寫出輸出，所以它是1 | 3而不是1 | | 3？ – snown

@snown：只是不添加只包含空格的列。檢查字符串中是否有非空格字符：'if column.strip（）'。 Unicode字符串上的方法'strip（）'將不間斷空格理解爲一個空格，並將其從字符串中移除。 – jfs

如何僅使用BeautifulSoup打印某些文本

回答

相關問題