2014-02-25 32 views
0

我已經提取了第二個表格,在第二個表格中,我需要提取具有column[0]中文件名的行。解析來自html的特定數據

<TABLE WIDTH="100%" BORDER="1" > 
<TR ><TD BGCOLOR="#FFFF99" ROWSPAN="1" COLSPAN="2" WIDTH="70%">Root</TD></TR> 
<TR ><TD BGCOLOR="#FFFFCC" ROWSPAN="1" COLSPAN="1" WIDTH="70%">Functions</TD><TD BGCOLOR="#FFFFCC" ROWSPAN="1" COLSPAN="1" WIDTH="30%"> &#160;&#160;&#160;10.1% (1077/10647)</TD></TR> 
<TR ><TD BGCOLOR="#FFFFCC" ROWSPAN="1" COLSPAN="1" WIDTH="70%">Functions and exits</TD><TD BGCOLOR="#FFFFCC" ROWSPAN="1" COLSPAN="1" WIDTH="30%"> &#160;&#160;&#160;&#160;9.5% (2142/22473)</TD></TR> 
<TR ><TD BGCOLOR="#FFFFCC" ROWSPAN="1" COLSPAN="1" WIDTH="70%">Statement blocks</TD><TD BGCOLOR="#FFFFCC" ROWSPAN="1" COLSPAN="1" WIDTH="30%"> &#160;&#160;&#160;&#160;9.1% (2191/24167)</TD></TR> 
<TR ><TD BGCOLOR="#FFFFCC" ROWSPAN="1" COLSPAN="1" WIDTH="70%">Decisions</TD><TD BGCOLOR="#FFFFCC" ROWSPAN="1" COLSPAN="1" WIDTH="30%"> &#160;&#160;&#160;&#160;8.8% (2648/29930)</TD></TR> 
<TR ><TD BGCOLOR="#FFFFCC" ROWSPAN="1" COLSPAN="1" WIDTH="70%">Loops</TD><TD BGCOLOR="#FFFFCC" ROWSPAN="1" COLSPAN="1" WIDTH="30%"> &#160;&#160;&#160;&#160;8.4% (305/3628)</TD></TR> 
<TR ><TD BGCOLOR="#FFFFCC" ROWSPAN="1" COLSPAN="1" WIDTH="70%">Basic conditions</TD><TD BGCOLOR="#FFFFCC" ROWSPAN="1" COLSPAN="1" WIDTH="30%"> &#160;&#160;&#160;&#160;8.3% (1759/21254)</TD></TR> 
<TR ><TD BGCOLOR="#FFFFCC" ROWSPAN="1" COLSPAN="1" WIDTH="70%">Modified conditions</TD><TD BGCOLOR="#FFFFCC" ROWSPAN="1" COLSPAN="1" WIDTH="30%"> &#160;&#160;&#160;&#160;1.8% (35/1997)</TD></TR> 
<TR ><TD BGCOLOR="#FFFFCC" ROWSPAN="1" COLSPAN="1" WIDTH="70%">Multiple conditions</TD><TD BGCOLOR="#FFFFCC" ROWSPAN="1" COLSPAN="1" WIDTH="30%"> &#160;&#160;&#160;&#160;4.4% (137/3082)</TD></TR> 

</TABLE> 
</P> 
<P ALIGN="LEFT"><BR> 
2 - Files list</P> 
<BR> 
Display absolute values only.<BR> 

<TABLE WIDTH="100%" BORDER="1" > 
<TR BGCOLOR="#FFFF99"><TD BGCOLOR="#FFFF99" ROWSPAN="1" COLSPAN="1" WIDTH="27%"><b>Item<IMG SRC="cvi_sort_d.png" ALT="cvi_sort_d.xpm"></b></TD><TD BGCOLOR="#FFFF99" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><b>Functions</b></TD><TD BGCOLOR="#FFFF99" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><b>Functions and exits</b></TD><TD BGCOLOR="#FFFF99" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><b>Statement blocks</b></TD><TD BGCOLOR="#FFFF99" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><b>Decisions</b></TD><TD BGCOLOR="#FFFF99" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><b>Loops</b></TD><TD BGCOLOR="#FFFF99" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><b>Basic conditions</b></TD><TD BGCOLOR="#FFFF99" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><b>Modified conditions</b></TD><TD BGCOLOR="#FFFF99" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><b>Multiple conditions</b></TD></TR> 
<TR ><TD BGCOLOR="#FF9999" ROWSPAN="1" COLSPAN="1" WIDTH="27%"><B><A NAME="175746848"></A><a href="LOADER.H.html">LOADER.H</a></B></TD><TD BGCOLOR="#FFDFDD" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">0/1</P> 
</TD><TD BGCOLOR="#FFDFDD" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">0/2</P> 
</TD><TD BGCOLOR="#FFDFDD" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">0/1</P> 
</TD><TD BGCOLOR="#FFDFDD" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">0/1</P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD></TR> 
<TR ><TD BGCOLOR="#9999FF" ROWSPAN="1" COLSPAN="1" WIDTH="27%"><A NAME="175746912"></A> &#160;&#160;&#160;<a href="LOADER.H.html">LoaderState_struct</a></TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD></TR> 
<TR ><TD BGCOLOR="#9999FF" ROWSPAN="1" COLSPAN="1" WIDTH="27%"><A NAME="175746976"></A> &#160;&#160;&#160;<a href="LOADER.H.html">LoadParameters_struct</a></TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD></TR> 
<TR ><TD BGCOLOR="#9999FF" ROWSPAN="1" COLSPAN="1" WIDTH="27%"><A NAME="175747104"></A> &#160;&#160;&#160;<a href="LOADER.H.html">LoadOffsets_struct</a></TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD></TR> 
<TR ><TD BGCOLOR="#9999FF" ROWSPAN="1" COLSPAN="1" WIDTH="27%"><A NAME="175747168"></A> &#160;&#160;&#160;<a href="LOADER.H.html">LoadAppComponent_struct</a></TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD></TR> 
<TR ><TD BGCOLOR="#FF9999" ROWSPAN="1" COLSPAN="1" WIDTH="27%"><B><A NAME="175746848"></A><a href="CORBA_FIXED.CC.html">CORBA_FIXED.CC</a></B></TD><TD BGCOLOR="#FFDFDD" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">0/1</P> 
</TD><TD BGCOLOR="#FFDFDD" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">0/2</P> 
</TD><TD BGCOLOR="#FFDFDD" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">0/1</P> 
</TD><TD BGCOLOR="#FFDFDD" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">0/1</P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD></TR> 
<TR ><TD BGCOLOR="#9999FF" ROWSPAN="1" COLSPAN="1" WIDTH="27%"><A NAME="175746912"></A> &#160;&#160;&#160;<a href="LOADER.H.html">LoaderState_struct</a></TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD></TR> 
<TR ><TD BGCOLOR="#9999FF" ROWSPAN="1" COLSPAN="1" WIDTH="27%"><A NAME="175746976"></A> &#160;&#160;&#160;<a href="LOADER.H.html">LoadParameters_struct</a></TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD></TR> 
<TR ><TD BGCOLOR="#9999FF" ROWSPAN="1" COLSPAN="1" WIDTH="27%"><A NAME="175747104"></A> &#160;&#160;&#160;<a href="LOADER.H.html">LoadOffsets_struct</a></TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD></TR> 
<TR ><TD BGCOLOR="#9999FF" ROWSPAN="1" COLSPAN="1" WIDTH="27%"><A NAME="175747168"></A> &#160;&#160;&#160;<a href="LOADER.H.html">LoadAppComponent_struct</a></TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD></TR> 
</TABLE> 

對於這個分析我寫了一個Python腳本如下:

from bs4 import BeautifulSoup 
f = open("/home/vignesh/Downloads/html/RateDoc.html","r") 
fl = {'LOADER.H','CORBA_FIXED.H'} 
soup = BeautifulSoup(f) 
t = soup.findAll('table') 
for table in t[1:]: 
    rows = table.findAll('tr') 
    for tr in rows[1:]: 
     cols = tr.findAll('td') 
     for td in cols: 
      text = ''.join((td.find(text=True)).encode('utf-8')) 
      print text+"\t", 
     print 
    print 


the above script extracts the data as follows: 


LOADER.H 0/1 0/2 0/1 0/1 none none none none  
    none none none none none none none none  
    none none none none none none none none  
        none none none none none none none none  
    none none none none none none none none  
CORBA_FIXED.CC 0/1 0/2 0/1 0/1 none none none none  
    none none none none none none none none  
    none none none none none none none none  
    none none none none none none none none  
    none none none none none none none none 

但該預期的結果如下,我想提取與擴展*.cc*.h

輸出的所有文件要求:

LOADER.H 0/1 0/2 0/1 0/1 none none none none  
CORBA_FIXED.CC 0/1 0/2 0/1 0/1 none none none none  

是否有人幫助我修改上述腳本,以便提取特定擴展*.cc*.h

回答

0
from bs4 import BeautifulSoup 

INPUT = "/home/vignesh/Downloads/html/RateDoc.html" 

def main(): 
    with open(INPUT, "rb") as inf: 
     soup = BeautifulSoup(inf) 

    for row in soup.findAll("tr"): 
     first_col = row.find("td") 
     links = first_col.findAll("a") 
     if len(links) == 2: 
      link_text = links[1].text 
      parts = link_text.rsplit(".", 1) 
      if len(parts) > 1 and parts[-1].lower() in {"h", "cc"}: 
       # print row 
       print("\t".join(cell.text.strip().encode("utf-8") for cell in row.findAll("td"))) 

產生

LOADER.H 0/1 0/2 0/1 0/1 none none none none 
CORBA_FIXED.CC 0/1 0/2 0/1 0/1 none none none none 
0

它會出現,如果你封裝你的數據在一個if,它應該工作。基於這樣的事實,要跳過線的初始打印似乎顯示一個空白項 其次是「無」的8個數值

if text is '': 
    break 
else: 
    print text + '\t', 

這是你的代碼的檢查只能作爲我目前不能對其進行測試。