2010-01-25 21 views
3

我如何使用正則表達式來找到這個表格在一個頁面(需要通過名稱來找到它):使用PHP和正則表達式來抓取標籤和數據,並存儲爲關聯數組

<table id="Table Name"> 
<tr><td class="label">Name:</td> 
<td class="data"><div class="datainfo">Stuff</div></td></tr> 
<tr><td class="label">Email:</td> 
<td class="data"><div class="datainfo">Stuff2</div></td></tr> 
<tr><td class="label">Address:</td> 
<td class="data"><div class="datainfo">Stuff3</div></td></tr> 
</table> 
<table id="Table Name 2"> 
<tr><td class="label">Field1:</td> 
<td class="data"><div class="datainfo">MoreStuff</div></td></tr> 
<tr><td class="label">Field2:</td> 
<td class="data"><div class="datainfo">MoreStuff2</div></td></tr> 
<tr><td class="label">Field3:</td> 
<td class="data"><div class="datainfo">MoreStuff3</div></td></tr> 
</table> 

然後抓住「標籤「與‘datainfo’,並在一個關聯數組它們存儲諸如:

$table_name[name] //Stuff 
$table_name[email] //Stuff2 
$table_name[address] //Stuff3 

$table_name2[field1] //MoreStuff 
$table_name2[field2] //Morestuff2 
$table_name2[field3] //Morestuff3 

回答

8

正則表達式是在這種情況下不良溶液。改爲使用Simple HTML Parser

更新: 這裏是功能的:

$html = str_get_html($html); 
print_r(get_table_fields($html, 'Table Name')); 
print_r(get_table_fields($html, 'Table Name 2')); 

function get_table_fields($html, $id) { 
    $table = $html->find('table[id='.$id.']', 0); 
    foreach ($table->find('tr') as $row) { 
     $key = $row->find('td', 0)->plaintext; 
     $value = $row->find('td', 1)->plaintext; 
     ## remove ending ':' symbol 
     $key = preg_replace('/:$/', '', $key); 
     $result[$key] = $value; 
    } 
    return $result; 
} 
+0

現在看看它,我知道必須有某種解決方案,非常感謝 – mrpatg 2010-01-25 08:26:57

+4

請參閱http:// stackover flow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454爲什麼。 – Ikke 2010-01-25 08:26:58

+0

啊謝謝你,你不必爲我寫出來,但我非常感謝它。 :) – mrpatg 2010-01-25 08:37:57

0

我從來沒有玩過簡單的HTML解析器,但我PHP的內置SimpleXML中的一個相當大的風扇。這完成了同樣的事情。

$XML = simplexml_load_string(file_get_contents('test_doc.html')); 

$all_labels = $XML->xpath("//td[@class='label']"); 
$all_datainfo = $XML->xpath("//div[@class='datainfo']"); 

$all = array_combine($all_labels,$all_datainfo); 
foreach($all as $k=>$v) { $final[preg_replace('/:$/', '', (string)$k)] = (string)$v; } 

print_r($final); 

,如果你想知道爲什麼我有這個循環鑄造的一切(串),做$全部print_r的。

最後的結果將是:

Array 
(
    [Name] => Stuff 
    [Email] => Stuff2 
    [Address] => Stuff3 
    [Field1] => MoreStuff 
    [Field2] => MoreStuff2 
    [Field3] => MoreStuff3 
) 
+0

它適用於HTML嗎? – 2010-01-25 12:02:42

+0

我放棄了他的示例HTML裏面的一個' ...'所以...是的,它確實:) – Erik 2010-01-25 14:27:06

0

我決定用PHP DOMDocument類

<?php 

$dom = new DOMDocument(); 

$dom->loadHTML(file_get_contents('stackoverflow_table.html')); 

$count = 0; 
$data = array(); 

while (++$count) { 
    $tableid = 'Table Name' . ($count > 1 ? ' ' . $count : ''); //getting the table id 
    $table = $dom->getElementById($tableid); 
    if ($table) { 
    $tds = $table->getElementsByTagName('td'); 

    if ($tds->length) { //did I get td's? 
     for ($i = 0, $l = $tds->length;$i < $l; $i+=2) { 
     $keyname = $tds->item($i)->firstChild->nodeValue; //get the value of the firs td 
     $value = null; 
     if ($tds->item($i+1)->hasChildNodes()) //check if the 2º td has children (the div) (this might always be true because of whitespace) 
      $value = $tds->item($i+1)->childNodes->item(1)->firstChild->nodeValue; //Get the div value (which is the second, because of whitespace) 

     $data[$keyname] = $value; 
     } 
    } 
    } 
    else //there is no table 
    break; 
} 

//should present the format you wanted :) 
var_dump($data); 

下面是我爲這個創建的HTML文件來創建代碼:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> 
<html> 
<head> 
<meta http-equiv="Expires" content="Fri, Jan 01 1900 00:00:00 GMT"> 
<meta http-equiv="Pragma" content="no-cache"> 
<meta http-equiv="Cache-Control" content="no-cache"> 
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"> 
<meta http-equiv="Lang" content="en"> 
<meta name="author" content=""> 
<meta http-equiv="Reply-to" content=""> 
<meta name="generator" content=""> 
<meta name="description" content=""> 
<meta name="keywords" content=""> 
<meta name="creation-date" content="11/11/2008"> 
<meta name="revisit-after" content="15 days"> 
<title>Example</title> 
<link rel="stylesheet" type="text/css" href="my.css"> 
</head> 
<body> 
<table id="Table Name"> 
    <tr> 
     <td class="label">Name:</td> 
     <td class="data"> 
      <div class="datainfo">Stuff</div> 
     </td> 
    </tr> 
    <tr> 
     <td class="label">Email:</td> 
     <td class="data"> 
      <div class="datainfo">Stuff2</div> 
     </td> 
    </tr> 
    <tr> 
     <td class="label">Address:</td> 
     <td class="data"> 
      <div class="datainfo">Stuff3</div> 
     </td> 
    </tr> 
</table> 
<table id="Table Name 2"> 
    <tr> 
     <td class="label">Field1:</td> 
     <td class="data"> 
      <div class="datainfo">MoreStuff</div> 
     </td> 
    </tr> 
    <tr> 
     <td class="label">Field2:</td> 
     <td class="data"> 
      <div class="datainfo">MoreStuff2</div> 
     </td> 
    </tr> 
    <tr> 
     <td class="label">Field3:</td> 
     <td class="data"> 
      <div class="datainfo">MoreStuff3</div> 
     </td> 
    </tr> 
</table> 
</body> 
</html>