2010-04-11 49 views
0

最近我開始學習Java正則表達式,並且發現了一些非常有趣的任務。例如,我現在需要挖掘「產品名稱」,「產品說明」和「此產品的賣家」 。以下HTML代碼(我對代碼的大chunck抱歉,但它是非常簡單)關於Java正則表達式的初級問題

<td class="sr-check"> 
<input type="checkbox" name="cptitle" value="678560038" /></td> 
<td class="sr-image" style="width: 80px;"><a href="/Nikon-D300S-12-3-678560038/prices-html"  class="strictRule" rel="nofollow"><img src="http://img01.static-nextag.com/image/Nikon-D300S-12-3-MP-Digital-SLR-Camera-Body-Black/0/000/006/789/461/678946110.jpg" alt="Nikon D300S 12.3 MP Digital SLR Camera Body - Black" class="imageLink strictRule" height="75" width="75" id="opILink_0" title="Nikon Digital Cameras - Nikon D300S 12.3 MP Digital SLR Camera Body - Black" /></a><div class="breaker">&nbsp;</div></td> 
<td class="sr-info"> 
<div class="sr-info"> 
<a id="opPNLink_0" class="underline" style="font-size:16px" href="/Nikon-D300S-12-3-678560038 /prices-html" >Nikon D300S 12.3 MP <b>Digital</b> SLR <b>Camera</b> Body - Black</a> <div class="sr-subinfo"> 
<div class="sr-info-description">SLR - 13.1MP, 12.3MP - 1x Optical Zoom - CompactFlash, SD/MMC Memory Card - 3in.</div> 
<div class="rating"> 
<img src="http://img01.static-nextag.com/imagefiles/stars/stars4_10px.gif" alt="4/5 stars" title="4/5 stars" /> (92 user ratings)</div> 
<div style="clear: both;"> 
<!-- nxtginc=nextag.api.ServerInclude$JSPIncludeWriter(/buyer/ATLSSI.jsp?ptid=678560038&dts=y) --> 
<a id="_atl_0" style="" href="http://www.nextag.com/serv/main/buyer/MyPDir.jsp?list=_transCookieList&amp;cmd=add&amp;ptitle=678560038" rel="nofollow">+ Add to Shopping List</a> &nbsp;|&nbsp; 
<!-- endnxtginc --> 
<a rel="nofollow" id="mltLink_0" class="mlt-link" href="/Digital-Cameras--zz500001z2z678560038zB2dgz5---html">See More Like This</a> 
</div> 
<div id="fsLink_0" class="featuredSeller"> 
<a rel="nofollow" class="featuredSeller" id="opFSLink_0_0" href="/norob/PtitleSeller.jsp?chnl=main&amp;tag=785646073amp;ctx=x%2BN%2Fs9zy56l4u8RXCzALE1jeLesDMzeK09rPQEdK3Yjx395ZzX9cMh9N5JAxjk7xPqF9hjk2ztM5IRXU5nspLubIXYaVzI%2B%2Fg7h1Qz58TzgvrWuNawV8qEIqqSmClArWMq6mpzNRuSlgg2xCXYObNnaIH00iKSUmBawDRvecwbCpAxhXgXoLEiEinTwr3EipComdzxL9UHFYTLoWUToUB5SRSsolQmEJ3mgnnvu83%2FC8W34TGpN9mJo%2BnyAeTkt4&amp;ptitle=678560038" target="_blank" >Thundercameras</a>:$1,289 &nbsp; 
<a rel="nofollow" class="featuredSeller" id="opFSLink_0_1" href="/norob/PtitleSeller.jsp?chnl=main&amp;tag=797076595&amp;ctx=x%2BN%2Fs9zy56l4u8RXCzALE1jeLesDMzeK09rPQEdK3Yjx395ZzX9cMh9N5JAxjk7xPqF9hjk2ztM5IRXU5nspLubIXYaVzI%2B%2Fg7h1Qz58TzgvrWuNawV8qEIqqSmClArWMq6mpzNRuSlgg2xCXYObNrcWLhL%2BhryuAGhXNhYSPE%2BpAxhXgXoLEiEinTwr3EipComdzxL9UHFYTLoWUToUB5SRSsolQmEJ3mgnnvu83%2FC8W34TGpN9mJo%2BnyAeTkt4&amp;ptitle=678560038" target="_blank" >PhotoVideoSuperStore</a>:$1,269 &nbsp; 
<a rel="nofollow" class="featuredSeller" id="opFSLink_0_2" href="/norob/PtitleSeller.jsp?chnl=main&amp;tag=803555293&amp;ctx=x%2BN%2Fs9zy56l4u8RXCzALE1jeLesDMzeK09rPQEdK3Yjx395ZzX9cMh9N5JAxjk7xPqF9hjk2ztM5IRXU5nspLubIXYaVzI%2B%2Fg7h1Qz58TzgvrWuNawV8qEIqqSmClArWMq6mpzNRuSlgg2xCXYObNt06qcvLJ5UQz7S3zKd4urWpAxhXgXoLEiEinTwr3EipComdzxL9UHFYTLoWUToUB5SRSsolQmEJ3mgnnvu83%2FC8W34TGpN9mJo%2BnyAeTkt4&amp;ptitle=678560038" target="_blank" >Digitalelect</a>:$1,279 &nbsp;</div> 

我會想到:

(1)挖掘出從<td class="sr-image >標籤的產品名稱,並使用正則表達式

exp ="<td><span\\s+class=\"sr-image\"[^>]*>" 
      + ".*?</span><a href=\"" 
      + "([^\"]+)"  
      + "\"[^>]*>"  
      + "([^<]+)" + "</a>.*?</td>"; 

(2)從<div class="sr-info-description">標籤中挖出產品信息。

exp = "<div class="sr-info-description"> [^>]*>" 

(3)從<div id="fsLink_0" class="featuredSeller">標籤中挖掘出賣家的姓名。

exp = "<div id="fslink_0" class="featuredSeller[^>]*>" 
      + ".*?</span><a rel=\"" 
      + "([^\"]+)"  
      + "\"[^>]*>"  
      + "([^<]+)" + "</a>.*?</td>"; 

我剛開始使用Java正則表達式性學習的,我將不勝感激,如果你能糾正我,如果我在錯誤的軌道還是我經常expressiona是錯誤的。 非常感謝,夥計們。

+8

嘗試避免解析HTML的正則表達式。 – 2010-04-11 21:07:21

+5

你應該真的考慮不要使用正則表達式來完成這個任務,而是一個(x)html解析器。請參閱http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – 2010-04-11 21:08:22

+0

謝謝Phild。 我會接受你的建議。我只對我的第二個查詢感到好奇,因爲標記很簡單,這是關於獲取產品信息的問題。 – Kevin 2010-04-11 21:15:40

回答

1

如上所述,您應該使用解析器來解釋html輸入。

但我想回答一個正則表達式的問題來提取文本行的產品信息像

<div class="sr-info-description">SLR - 13.1MP, 12.3MP - 1x Optical Zoom - CompactFlash, SD/MMC Memory Card - 3in.</div> 

假設它是所有一行,並通過本身不包含任何標籤(在這種情況下,你絕對需要使用HTML解析器),正則表達式應該像

<div class="sr-info-description">([^<]*)</div> 

構建匹配器爲你的表達,find()它在你的輸入,然後group(1)包含div標籤內的文本(而group(0)包含包含div標籤的匹配區域)。