2016-11-09 213 views
0

我明白用正則表達式解析html並不理想,但我有一個用例。正則表達式 - 如何正確地抓取嵌套值

我有這樣的覆蓋報告/ html頁面:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> 

<html lang="en"> 

<head> 
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> 
    <title>LCOV - .info.cleaned</title> 
    <link rel="stylesheet" type="text/css" href="gcov.css"> 
</head> 

<body> 

    <table width="100%" border=0 cellspacing=0 cellpadding=0> 
    <tr><td class="title">LCOV - code coverage report</td></tr> 
    <tr><td class="ruler"><img src="glass.png" width=3 height=3 alt=""></td></tr> 

    <tr> 
     <td width="100%"> 
     <table cellpadding=1 border=0 width="100%"> 
      <tr> 
      <td width="10%" class="headerItem">Current view:</td> 
      <td width="35%" class="headerValue">top level</td> 
      <td width="5%"></td> 
      <td width="15%"></td> 
      <td width="10%" class="headerCovTableHead">Hit</td> 
      <td width="10%" class="headerCovTableHead">Total</td> 
      <td width="15%" class="headerCovTableHead">Coverage</td> 
      </tr> 
      <tr> 
      <td class="headerItem">Test:</td> 
      <td class="headerValue">.info.cleaned</td> 
      <td></td> 
      <td class="headerItem">Lines:</td> 
      <td class="headerCovTableEntry">399</td> 
      <td class="headerCovTableEntry">1019</td> 
      <td class="headerCovTableEntryLo">39.2 %</td> 
      </tr> 
      <tr> 
      <td class="headerItem">Date:</td> 
      <td class="headerValue">2016-11-07</td> 
      <td></td> 
      <td class="headerItem">Functions:</td> 
      <td class="headerCovTableEntry">22</td> 
      <td class="headerCovTableEntry">67</td> 
      <td class="headerCovTableEntryLo">32.8 %</td> 
      </tr> 
      <tr><td><img src="glass.png" width=3 height=3 alt=""></td></tr> 
     </table> 
     </td> 
    </tr> 

    <tr><td class="ruler"><img src="glass.png" width=3 height=3 alt=""></td></tr> 
    </table> 

    <center> 
    <table width="80%" cellpadding=1 cellspacing=1 border=0> 

    <tr> 
     <td width="50%"><br></td> 
     <td width="10%"></td> 
     <td width="10%"></td> 
     <td width="10%"></td> 
     <td width="10%"></td> 
     <td width="10%"></td> 
    </tr> 

    <tr> 
     <td class="tableHead">Directory <span class="tableHeadSort"><img src="glass.png" width=10 height=14 alt="Sort by name" title="Sort by name" border=0></span></td> 
     <td class="tableHead" colspan=3>Line Coverage <span class="tableHeadSort"><a href="index-sort-l.html"><img src="updown.png" width=10 height=14 alt="Sort by line coverage" title="Sort by line coverage" border=0></a></span></td> 
     <td class="tableHead" colspan=2>Functions <span class="tableHeadSort"><a href="index-sort-f.html"><img src="updown.png" width=10 height=14 alt="Sort by function coverage" title="Sort by function coverage" border=0></a></span></td> 
    </tr> 
    <tr> 
     <td class="coverFile"><a href="src/index.html">src</a></td> 
     <td class="coverBar" align="center"> 
     <table border=0 cellspacing=0 cellpadding=1><tr><td class="coverBarOutline"><img src="ruby.png" width=39 height=10 alt="39.2%"><img src="snow.png" width=61 height=10 alt="39.2%"></td></tr></table> 
     </td> 
     <td class="coverPerLo">39.2&nbsp;%</td> 
     <td class="coverNumLo">399/1019</td> 
     <td class="coverPerLo">32.8&nbsp;%</td> 
     <td class="coverNumLo">22/67</td> 
    </tr> 
    </table> 
    </center> 
    <br> 

    <table width="100%" border=0 cellspacing=0 cellpadding=0> 
    <tr><td class="ruler"><img src="glass.png" width=3 height=3 alt=""></td></tr> 
    <tr><td class="versionInfo">Generated by: <a href="http://ltp.sourceforge.net/coverage/lcov.php">LCOV version 1.10</a></td></tr> 
    </table> 
    <br> 

</body> 
</html> 

我試圖從該行解析出數據:

<td class="headerCovTableEntryLo">39.2 %</td> 

爲39.2(浮點值)。

我目前使用這個正則表達式來找到兩個匹配TD的:

<td class="headerCovTableEntryLo">[0-9.].*?.%<\/td> 

我誤解組的工作。我想:

(<td class="headerCovTableEntryLo">[0-9.].*?.%<\/td>)[0-9.].*?\1 

採取什麼是在第一組中發現和抓住只是數字的值,但我有零個匹配。任何人都可以借鑑一些我做錯了什麼?

+0

哪您使用的語言/工具是? –

+0

'正則表達式 - 如何正確地抓取嵌套值?'...不要使用正則表達式,使用HTML解析器。 –

+0

謝謝你們兩位......我知道HTML解析器會是首選,而我在rails中。不幸的是,在我工作的系統/環境下,這並不容易。 – isuPatches

回答

2

這是你想要執行的嗎? (僅捕捉浮動值):

<(td) class="headerCovTableEntryLo">([0-9.]+)\s?%<\/\1>

看到它在這裏工作:https://regex101.com/r/qprROm/2

如果是這樣,如果你嘗試重用你作出正確的使用它的第一場比賽與\1或以匹配哪個被捕獲的組。但是在你的試驗中,你還捕獲了在結束標記中不匹配的類。

不知道這確實是你想要做的。哈哈

另外,在這種情況下做<(td)>(.*?)<\/\1>真的沒有意義。更usefill如果你的用例是這樣的<(td|th|tr)>(.*?)<\/\1>

在結束時,如果我是這樣做,我寧願做更多的靈活性這樣:(?<=class="headerCovTableEntryLo">)([0-9.]+)(?=\s?%)

看到它在這裏工作:https://regex101.com/r/qprROm/3

+0

謝謝! (?<= class =「headerCovTableEntryLo」>)([0-9。] +)(?= \ s?%)正是我需要捕獲浮點值 – isuPatches