2011-01-26 54 views
1

我需要幫助解析具有我不確定如何解析的佈局的HTML文本文件,並且可以真正使用幫助。Python解析:從非標準佈局的HTML測試文件中提取數據

代碼迄今:

import urllib,os, urllib2, webbrowser, StringIO, re 
from BeautifulSoup import BeautifulSoup 
from urllib import urlopen 

urlfile = open('output.txt','r') 

html = urlfile 

soup = BeautifulSoup(''.join(html)) 

print soup.prettify() 
table = soup.find('table', id="dgProducts__ctl2_lblCountry") 
rows = table.findAll('<span id="dgProducts__ctl2_lblCountry">') 

for tr in rows: 
    cols = tr.findAll('td') 
for td in cols: 
    text = ''.join(td.find(text=True)) 
    print text+"|", 
print 

我正在嘗試做的: 我期待從html的文本文件中提取數據,並將其按以下格式提交:

Header Row: Country Company Name Company Product Name  Status 
Data Row(s): 1  Ace   Desktop  Ace Vision Gold 

簡稱.html文件數據結構:

</tr><tr bgcolor="White"> 
    <td><font color="#330099" size="1"> 
     <span><font size="2"> 
      <input id="dgProducts__ctl12_ckCompare" type="checkbox" name="dgProducts:_ctl12:ckCompare" onclick="checkSelected(this.form, this);" /> 
      </font></span> 
     </font></td><td><font color="#330099" size="1"> 
     <span id="dgProducts__ctl12_lblModel1"><font size="2"> 
      <a href='ProductDisplay.aspx?return=pm&action=view&search=true&productid=4592&ProductType=1&epeatcountryid=1'>Ace Vision 7HS</a></font></span> 
     </font></td><td><font color="#330099" size="1"> 
     <span id="dgProducts__ctl12_lblCountry">United States</span> 
     </font></td><td><font color="#330099" size="1"> 
     <span id="dgProducts__ctl12_lblProductCategory1"><font size="2">Desktops</font></span> 
     </font></td><td><font color="#330099" size="1"> 
     <span id="dgProducts__ctl12_lblRating1"><font size="2">Gold</font></span> 
     </font></td><td><font color="#330099" size="1"> 
     <span id="dgProducts__ctl12_lblPoints1">18</span> 
     </font></td><td><font color="#330099" size="1"> 
     <span id="dgProducts__ctl12_lblEnergyStar">5.0</span> 
     </font></td><td><font color="#330099" size="1"> 
     <span id="dgProducts__ctl12_lblMonitorType1"><font size="2"></font></span> 
     </font></td><td><font color="#330099" size="1"> 
     <span id="dgProducts__ctl12_lblMonitorSize"><font size="2"></font></span> 
     </font></td><td><font color="#330099" size="1"> 
     <span id="dgProducts__ctl12_lblListingDate1"><font size="2">3/16/2010</font></span> 
     </font></td><td><font color="#330099" size="1"> 
     <span id="dgProducts__ctl12_lblStatus"><font size="2">Active</font></span> 
     </font></td><td><font color="#330099" size="1"> 
     <span id="dgProducts__ctl12_lblExceptions" align="center"><a href='#' onclick=ShowExceptions('Exceptions.aspx?id=4592');>  
      <img src='http://www.epeat.net/Images/inform.gif' title='Click to view exceptions' alt='Click to view exceptions' border='0'></a></span> 
     </font></td> 
+1

剛剛偶然發現`''.join(html)`而不是`html.read()`。那麼,對於他自己的:) – 2011-01-26 14:19:04

回答

0

我建議您使用稱爲MiniDom或xml.dom.minidom的模塊。它可以很容易地解析XML和HTML文件。