2017-02-24 272 views
0

我是新來的beautifulsoup和python,我敢肯定這是一個簡單的問題,但我似乎無法解決它。python beautifulsoup循環遍歷表格行

我想循環通過一個html表的行,基於「標題」行按糖果類型分組表。我的表看起來像這樣: enter image description here

我想循環獲取每個糖果標題下的日期。因此,迭代會得到這樣的數據:

第一循環迭代: candy_type:奇巧, 位置:商城1, 計劃:63, 實際:0, DIFF:25

第二迭代: candy_type:奇巧, 位置:購物中心2, 計劃:7, 實際:0, DIFF:6

......最後一次迭代: candy_type:彩虹糖, 位置:2號樓, 計劃:320, 實際:236, DIFF:0

這是表代碼:

<TABLE BORDER="1" WIDTH="100%"> 
    <TR> 
     <TH COLSPAN=4>Candy</TH> 
    </TR> 
    <TR BGCOLOR=#CEE3F6> 
     <TD COLSPAN=4> 
     <FONT FACE=Arial> 
      <center><b>KitKat</b></center> 
     </FONT> 
     </TD> 
    </TR> 
    <TR BGCOLOR=#336699> 
     <TD><FONT COLOR=White FACE=Arial SIZE=-2>LOCATION</FONT></TD> 
     <TD><FONT COLOR=White FACE=Arial SIZE=-2>PLANNED</FONT></TD> 
     <TD><FONT COLOR=White FACE=Arial SIZE=-2>ACTUAL</FONT></TD> 
     <TD><FONT COLOR=White FACE=Arial SIZE=-2>DIFF</FONT></TD> 
    </TR> 
    <TR> 
     <TD>Mall 1</TD> 
     <TD>63</TD> 
     <TD>0</TD> 
     <TD>25</TD> 
    </TR> 
    <TR> 
     <TD>Mall 2</TD> 
     <TD>7</TD> 
     <TD>0</TD> 
     <TD>6</TD> 
    </TR> 
    <TR BGCOLOR=#CEE3F6> 
     <TD COLSPAN=4> 
     <FONT FACE=Arial> 
      <center><b>OH Henry</b></center> 
     </FONT> 
     </TD> 
    </TR> 
    <TR BGCOLOR=#336699> 
     <TD><FONT COLOR=White FACE=Arial SIZE=-2>LOCATION</FONT></TD> 
     <TD><FONT COLOR=White FACE=Arial SIZE=-2>PLANNED</FONT></TD> 
     <TD><FONT COLOR=White FACE=Arial SIZE=-2>ACTUAL</FONT></TD> 
     <TD><FONT COLOR=White FACE=Arial SIZE=-2>DIFF</FONT></TD> 
    </TR> 
    <TR> 
     <TD>Warehouse 1</TD> 
     <TD>195</TD> 
     <TD>122</TD> 
     <TD>30</TD> 
    </TR> 
    <TR> 
     <TD>Warehouse 2</TD> 
     <TD>96</TD> 
     <TD>76</TD> 
     <TD>6</TD> 
    </TR> 
    <TR BGCOLOR=#CEE3F6> 
     <TD COLSPAN=4> 
     <FONT FACE=Arial> 
      <center><b>Skittles</b></center> 
     </FONT> 
     </TD> 
    </TR> 
    <TR BGCOLOR=#336699> 
     <TD><FONT COLOR=White FACE=Arial SIZE=-2>LOCATION</FONT></TD> 
     <TD><FONT COLOR=White FACE=Arial SIZE=-2>PLANNED</FONT></TD> 
     <TD><FONT COLOR=White FACE=Arial SIZE=-2>ACTUAL</FONT></TD> 
     <TD><FONT COLOR=White FACE=Arial SIZE=-2>DIFF</FONT></TD> 
    </TR> 
    <TR> 
     <TD>Building 1</TD> 
     <TD>120</TD> 
     <TD>90</TD> 
     <TD>5</TD> 
    </TR> 
    <TR> 
     <TD>Building 2</TD> 
     <TD>320</TD> 
     <TD>236</TD> 
     <TD>0</TD> 
    </TR> 
</TABLE> 

所以我試圖

from bs4 import BeautifulSoup 
import urllib 

readUrl = urllib.urlopen('test.html').read() 
soup = BeautifulSoup(readUrl) 
candytype = soup.findAll('tr',{"bgcolor" : "#CEE3F6"}) 
for type in candytype: 
    print type 

這會打印出了三種糖果類型是這樣的:

<tr bgcolor="#CEE3F6"> 
<td colspan="4"> 
<font face="Arial"> 
</font><center><b>KitKat</b></center> 
</td> 
</tr> 
<tr bgcolor="#CEE3F6"> 
<td colspan="4"> 
<font face="Arial"> 
</font><center><b>OH Henry</b></center> 
</td> 
</tr> 
<tr bgcolor="#CEE3F6"> 
<td colspan="4"> 
<font face="Arial"> 
</font><center><b>Skittles</b></center> 
</td> 
</tr> 

我以爲我可以將糖果「標題」(即標題)分組。 tr元素的bgcolor設置爲#CEE3F6),然後在此基礎上迭代,但我無法弄清楚如何進一步查看數據。

任何想法?

+0

你必須使用'beautifulsoup'嗎?我會推薦使用['parsel'](https://github.com/scrapy/parsel) – eLRuLL

回答

2

查找所有行,然後遍歷它們。當您找到一個包含糖果名稱的行(按行的顏色)時,請保留該名稱。現在確定該行的下一個兄弟姐妹。跳過第一個,這將是一個標題,但會捕獲td元素中的後續文本。當你遇到不同糖果的名字時,你知道你已經找到了最後一個兄弟姐妹(再次是該行的顏色)。

>>> from bs4 import BeautifulSoup 
>>> soup = BeautifulSoup(open('justTable.htm').read(), 'lxml') 
>>> trs = soup.findAll('tr') 
>>> for tr in trs: 
...  if 'bgcolor' in tr.attrs and tr.attrs['bgcolor']=='#CEE3F6': 
...   candy = tr.text.strip() 
...   first = True 
...   for sibs in tr.fetchNextSiblings(): 
...    if first: 
...     first = False 
...     continue 
...    if 'bgcolor' in sibs.attrs and sibs.attrs['bgcolor']=='#CEE3F6': 
...     break 
...    [candy]+sibs.text.strip().split('\n') 
... 
['KitKat', 'Mall 1', '63', '0', '25'] 
['KitKat', 'Mall 2', '7', '0', '6'] 
['OH Henry', 'Warehouse 1', '195', '122', '30'] 
['OH Henry', 'Warehouse 2', '96', '76', '6'] 
['Skittles', 'Building 1', '120', '90', '5'] 
['Skittles', 'Building 2', '320', '236', '0']