如何循環遍歷Python中的html表格數據集

我是第一次在這裏嘗試獲取一些Python技能的海報;請對我友好:-)如何循環遍歷Python中的html表格數據集

雖然我對編程概念並不陌生（我之前一直在搞PHP），但對Python的過渡對我來說變得有點困難。我想這主要是因爲我缺乏大部分 - 如果不是全部 - 對普通「設計模式」（？）等的基本理解。

說了這麼多，就是這個問題。我目前的一部分工作是利用美麗的湯來寫一個簡單的刮板。要處理的數據與下面列出的數據具有相似的結構。

<table> 
    <tr> 
     <td class="date">2011-01-01</td> 
    </tr> 
    <tr class="item"> 
     <td class="headline">Headline</td> 
     <td class="link"><a href="#">Link</a></td> 
    </tr> 
    <tr class="item"> 
     <td class="headline">Headline</td> 
     <td class="link"><a href="#">Link</a></td> 
    </tr> 
    <tr> 
     <td class="date">2011-01-02</td> 
    </tr> 
    <tr class="item"> 
     <td class="headline">Headline</td> 
     <td class="link"><a href="#">Link</a></td> 
    </tr> 
    <tr class="item"> 
     <td class="headline">Headline</td> 
     <td class="link"><a href="#">Link</a></td> 
    </tr> 
</table>

的主要問題是，我根本不能讓我圍繞着如何1）保持當前的日期（TR-> TD類=「日期」的軌跡），而2頭）循環遍歷項目後續的tr：s（tr class =「item」 - > td class =「headline」和tr class =「item」 - > td class =「link」）以及3）將處理後的數據存儲在一個數組中。

此外，所有數據將被插入數據庫，其中每個條目必須包含以下信息;

日期
標題
鏈接

注意污物：荷蘭國際集團的數據庫不是問題的一部分，我只是爲了更好地說明什麼，我想提到這個在這裏完成:-)

現在，有很多不同的方法來皮膚貓。因此，雖然解決手頭問題的方法確實非常受歡迎，但如果有人願意詳細闡述爲了「攻擊」這類問題而使用的實際邏輯和策略，我將非常感激:-)

最後但並非最不重要的是，對於這樣一個不好的問題抱歉。

來源

2011-01-07 Mattias

基本的問題是，這張表是標記的外觀，而不是語義結構。正確完成後，每個日期及其相關項目應共享一位家長。不幸的是，他們沒有，所以我們不得不做。

的基本策略是通過各行的表進行迭代

如果第一個資料表具有一流的「日期」，我們得到的日期值和更新last_seen_date
否則，我們得到提取標題和鏈接，然後將（last_seen_date，標題，鏈接）保存到數據庫中

。

import BeautifulSoup 

fname = r'c:\mydir\beautifulSoup.html' 
soup = BeautifulSoup.BeautifulSoup(open(fname, 'r')) 

items = [] 
last_seen_date = None 
for el in soup.findAll('tr'): 
    daterow = el.find('td', {'class':'date'}) 
    if daterow is None:  # not a date - get headline and link 
     headline = el.find('td', {'class':'headline'}).text 
     link = el.find('a').get('href') 
     items.append((last_seen_date, headline, link)) 
    else:     # get new date 
     last_seen_date = daterow.text

來源

2011-01-07 04:11:13

嗨，休，我決定和你的建議一起去做，結果非常好。謝謝你的努力！ :-) – Mattias 2011-01-08 03:00:20

您可以使用Python包中包含的元素樹。

http://docs.python.org/library/xml.etree.elementtree.html

from xml.etree.ElementTree import ElementTree 

tree = ElementTree() 
tree.parse('page.xhtml') #This is the XHTML provided in the OP 
root = tree.getroot() #Returns the heading "table" element 
print(root.tag) #"table" 
for eachTableRow in root.getchildren(): 
    #root.getchildren() is a list of all of the <tr> elements 
    #So we're going to loop over them and check their attributes 
    if 'class' in eachTableRow.attrib: 
     #Good to go. Now we know to look for the headline and link 
     pass 
    else: 
     #Okay, so look for the date 
     pass

這應該是足以讓你對你的方式來解析這一點。

來源

2011-01-07 04:06:55 user407896

嗨，感謝您的輸入。我目前正在使用beautifulsoup作刮擦用途，但我很可能會很快考慮Element Tree。乾杯! :-) – Mattias 2011-01-08 03:11:12

如何循環遍歷Python中的html表格數據集

回答

相關問題