2013-01-05 42 views
11

晚上好,我已經使用BeautifulSoup從網站中提取一些數據如下:beautifulSoup HTML CSV

from BeautifulSoup import BeautifulSoup 
from urllib2 import urlopen 

soup = BeautifulSoup(urlopen('http://www.fsa.gov.uk/about/media/facts/fines/2002')) 

table = soup.findAll('table', attrs={ "class" : "table-horizontal-line"}) 

print table 

這給出了以下的輸出:

[<table width="70%" class="table-horizontal-line"> 
<tr> 
<th>Amount</th> 
<th>Company or person fined</th> 
<th>Date</th> 
<th>What was the fine for?</th> 
<th>Compensation</th> 
</tr> 
<tr> 
<td><a name="_Hlk74714257" id="_Hlk74714257">&#160;</a>£4,000,000</td> 
<td><a href="/pages/library/communication/pr/2002/124.shtml">Credit Suisse First Boston International </a></td> 
<td>19/12/02</td> 
<td>Attempting to mislead the Japanese regulatory and tax authorities</td> 
<td>&#160;</td> 
</tr> 
<tr> 
<td>£750,000</td> 
<td><a href="/pages/library/communication/pr/2002/123.shtml">Royal Bank of Scotland plc</a></td> 
<td>17/12/02</td> 
<td>Breaches of money laundering rules</td> 
<td>&#160;</td> 
</tr> 
<tr> 
<td>£1,000,000</td> 
<td><a href="/pages/library/communication/pr/2002/118.shtml">Abbey Life Assurance Company ltd</a></td> 
<td>04/12/02</td> 
<td>Mortgage endowment mis-selling and other failings</td> 
<td>Compensation estimated to be between £120 and £160 million</td> 
</tr> 
<tr> 
<td>£1,350,000</td> 
<td><a href="/pages/library/communication/pr/2002/087.shtml">Royal &#38; Sun Alliance Group</a></td> 
<td>27/08/02</td> 
<td>Pension review failings</td> 
<td>Redress exceeding £32 million</td> 
</tr> 
<tr> 
<td>£4,000</td> 
<td><a href="/pubs/final/ft-inv-ins_7aug02.pdf" target="_blank">F T Investment &#38; Insurance Consultants</a></td> 
<td>07/08/02</td> 
<td>Pensions review failings</td> 
<td>&#160;</td> 
</tr> 
<tr> 
<td>£75,000</td> 
<td><a href="/pubs/final/spe_18jun02.pdf" target="_blank">Seymour Pierce Ellis ltd</a></td> 
<td>18/06/02</td> 
<td>Breaches of FSA Principles ("skill, care and diligence" and "internal organization")</td> 
<td>&#160;</td> 
</tr> 
<tr> 
<td>£120,000</td> 
<td><a href="/pages/library/communication/pr/2002/051.shtml">Ward Consultancy plc</a></td> 
<td>14/05/02</td> 
<td>Pension review failings</td> 
<td>&#160;</td> 
</tr> 
<tr> 
<td>£140,000</td> 
<td><a href="/pages/library/communication/pr/2002/036.shtml">Shawlands Financial Services ltd</a> - formerly Frizzell Life &#38; Financial Planning ltd)</td> 
<td>11/04/02</td> 
<td>Record keeping and associated compliance breaches</td> 
<td>&#160;</td> 
</tr> 
<tr> 
<td>£5,000</td> 
<td><a href="/pubs/final/woodwards_4apr02.pdf" target="_blank">Woodward's Independent Financial Advisers</a></td> 
<td>04/04/02</td> 
<td>Pensions review failings</td> 
<td>&#160;</td> 
</tr> 
</table>] 

我想這個導出到CSV,同時保持網站上顯示的表格結構,這是可能的,如果是這樣的話?

在此先感謝您的幫助。

+1

你可能想看看這個解決方案 - HTTP ://sebsauvage.net/python/html2csv.py。 通過谷歌搜索「html to csv python」找到它:) – Infinity

+0

謝謝,雖然這個解決方案看起來很複雜嗎?我希望有一個更簡單的方法考慮我有一個相對乾淨的格式的所有數據...如果沒有,我會嘗試遵循這個:-) –

回答

23

這是一個基本的東西,你可以嘗試。這使得假設headers都在<th>標籤中,並且所有後續數據都在<td>標籤中。這在您提供的單個案例中有效,但我確定在其他情況下需要進行調整:)總體思路是,一旦找到您的table(這裏使用find來拉第一個),我們通過迭代得到headers通過所有th元素,將它們存儲在一個列表中。然後,我們創建一個rows列表,其中將包含表示每行內容的列表。通過找到tr標籤下的所有td元素並採用編碼爲UTF-8(來自Unicode)的text來填充該標籤。然後,你打開一個CSV,寫headers第一,然後寫所有的rows, but using的(一行一行行,如果行)`以消除任何空白行):

In [117]: import csv 

In [118]: from bs4 import BeautifulSoup 

In [119]: from urllib2 import urlopen 

In [120]: soup = BeautifulSoup(urlopen('http://www.fsa.gov.uk/about/media/facts/fines/2002')) 

In [121]: table = soup.find('table', attrs={ "class" : "table-horizontal-line"}) 

In [122]: headers = [header.text for header in table.find_all('th')] 

In [123]: rows = [] 

In [124]: for row in table.find_all('tr'): 
    .....:  rows.append([val.text.encode('utf8') for val in row.find_all('td')]) 
    .....: 

In [125]: with open('output_file.csv', 'wb') as f: 
    .....:  writer = csv.writer(f) 
    .....:  writer.writerow(headers) 
    .....:  writer.writerows(row for row in rows if row) 
    .....: 

In [126]: cat output_file.csv 
Amount,Company or person fined,Date,What was the fine for?,Compensation 
" £4,000,000",Credit Suisse First Boston International ,19/12/02,Attempting to mislead the Japanese regulatory and tax authorities, 
"£750,000",Royal Bank of Scotland plc,17/12/02,Breaches of money laundering rules, 
"£1,000,000",Abbey Life Assurance Company ltd,04/12/02,Mortgage endowment mis-selling and other failings,Compensation estimated to be between £120 and £160 million 
"£1,350,000",Royal & Sun Alliance Group,27/08/02,Pension review failings,Redress exceeding £32 million 
"£4,000",F T Investment & Insurance Consultants,07/08/02,Pensions review failings, 
"£75,000",Seymour Pierce Ellis ltd,18/06/02,"Breaches of FSA Principles (""skill, care and diligence"" and ""internal organization"")", 
"£120,000",Ward Consultancy plc,14/05/02,Pension review failings, 
"£140,000",Shawlands Financial Services ltd - formerly Frizzell Life & Financial Planning ltd),11/04/02,Record keeping and associated compliance breaches, 
"£5,000",Woodward's Independent Financial Advisers,04/04/02,Pensions review failings, 
+0

謝謝,這看起來像一個完美的解決方案。但是,我似乎得到一個SyntaxError'cat output_file.csv'行,它只是讀取無效的語法? –

+1

@ merlin_1980對不起,應該提到這是一個IPython特定的事情(基本上只是試圖顯示文件的內容)。如果你到了這一點,你應該把文件保存在那個目錄中。 – RocketDonkey

+0

非常感謝:-)我沒想過要查看目錄並手動打開文件! –