最近我開始學習Java正則表達式,並且發現了一些非常有趣的任務。例如,我現在需要挖掘「產品名稱」,「產品說明」和「此產品的賣家」 。以下HTML代碼(我對代碼的大chunck抱歉,但它是非常簡單)關於Java正則表達式的初級問題
<td class="sr-check">
<input type="checkbox" name="cptitle" value="678560038" /></td>
<td class="sr-image" style="width: 80px;"><a href="/Nikon-D300S-12-3-678560038/prices-html" class="strictRule" rel="nofollow"><img src="http://img01.static-nextag.com/image/Nikon-D300S-12-3-MP-Digital-SLR-Camera-Body-Black/0/000/006/789/461/678946110.jpg" alt="Nikon D300S 12.3 MP Digital SLR Camera Body - Black" class="imageLink strictRule" height="75" width="75" id="opILink_0" title="Nikon Digital Cameras - Nikon D300S 12.3 MP Digital SLR Camera Body - Black" /></a><div class="breaker"> </div></td>
<td class="sr-info">
<div class="sr-info">
<a id="opPNLink_0" class="underline" style="font-size:16px" href="/Nikon-D300S-12-3-678560038 /prices-html" >Nikon D300S 12.3 MP <b>Digital</b> SLR <b>Camera</b> Body - Black</a> <div class="sr-subinfo">
<div class="sr-info-description">SLR - 13.1MP, 12.3MP - 1x Optical Zoom - CompactFlash, SD/MMC Memory Card - 3in.</div>
<div class="rating">
<img src="http://img01.static-nextag.com/imagefiles/stars/stars4_10px.gif" alt="4/5 stars" title="4/5 stars" /> (92 user ratings)</div>
<div style="clear: both;">
<!-- nxtginc=nextag.api.ServerInclude$JSPIncludeWriter(/buyer/ATLSSI.jsp?ptid=678560038&dts=y) -->
<a id="_atl_0" style="" href="http://www.nextag.com/serv/main/buyer/MyPDir.jsp?list=_transCookieList&cmd=add&ptitle=678560038" rel="nofollow">+ Add to Shopping List</a> |
<!-- endnxtginc -->
<a rel="nofollow" id="mltLink_0" class="mlt-link" href="/Digital-Cameras--zz500001z2z678560038zB2dgz5---html">See More Like This</a>
</div>
<div id="fsLink_0" class="featuredSeller">
<a rel="nofollow" class="featuredSeller" id="opFSLink_0_0" href="/norob/PtitleSeller.jsp?chnl=main&tag=785646073amp;ctx=x%2BN%2Fs9zy56l4u8RXCzALE1jeLesDMzeK09rPQEdK3Yjx395ZzX9cMh9N5JAxjk7xPqF9hjk2ztM5IRXU5nspLubIXYaVzI%2B%2Fg7h1Qz58TzgvrWuNawV8qEIqqSmClArWMq6mpzNRuSlgg2xCXYObNnaIH00iKSUmBawDRvecwbCpAxhXgXoLEiEinTwr3EipComdzxL9UHFYTLoWUToUB5SRSsolQmEJ3mgnnvu83%2FC8W34TGpN9mJo%2BnyAeTkt4&ptitle=678560038" target="_blank" >Thundercameras</a>:$1,289
<a rel="nofollow" class="featuredSeller" id="opFSLink_0_1" href="/norob/PtitleSeller.jsp?chnl=main&tag=797076595&ctx=x%2BN%2Fs9zy56l4u8RXCzALE1jeLesDMzeK09rPQEdK3Yjx395ZzX9cMh9N5JAxjk7xPqF9hjk2ztM5IRXU5nspLubIXYaVzI%2B%2Fg7h1Qz58TzgvrWuNawV8qEIqqSmClArWMq6mpzNRuSlgg2xCXYObNrcWLhL%2BhryuAGhXNhYSPE%2BpAxhXgXoLEiEinTwr3EipComdzxL9UHFYTLoWUToUB5SRSsolQmEJ3mgnnvu83%2FC8W34TGpN9mJo%2BnyAeTkt4&ptitle=678560038" target="_blank" >PhotoVideoSuperStore</a>:$1,269
<a rel="nofollow" class="featuredSeller" id="opFSLink_0_2" href="/norob/PtitleSeller.jsp?chnl=main&tag=803555293&ctx=x%2BN%2Fs9zy56l4u8RXCzALE1jeLesDMzeK09rPQEdK3Yjx395ZzX9cMh9N5JAxjk7xPqF9hjk2ztM5IRXU5nspLubIXYaVzI%2B%2Fg7h1Qz58TzgvrWuNawV8qEIqqSmClArWMq6mpzNRuSlgg2xCXYObNt06qcvLJ5UQz7S3zKd4urWpAxhXgXoLEiEinTwr3EipComdzxL9UHFYTLoWUToUB5SRSsolQmEJ3mgnnvu83%2FC8W34TGpN9mJo%2BnyAeTkt4&ptitle=678560038" target="_blank" >Digitalelect</a>:$1,279 </div>
我會想到:
(1)挖掘出從<td class="sr-image >
標籤的產品名稱,並使用正則表達式
exp ="<td><span\\s+class=\"sr-image\"[^>]*>"
+ ".*?</span><a href=\""
+ "([^\"]+)"
+ "\"[^>]*>"
+ "([^<]+)" + "</a>.*?</td>";
(2)從<div class="sr-info-description">
標籤中挖出產品信息。
exp = "<div class="sr-info-description"> [^>]*>"
(3)從<div id="fsLink_0" class="featuredSeller">
標籤中挖掘出賣家的姓名。
exp = "<div id="fslink_0" class="featuredSeller[^>]*>"
+ ".*?</span><a rel=\""
+ "([^\"]+)"
+ "\"[^>]*>"
+ "([^<]+)" + "</a>.*?</td>";
我剛開始使用Java正則表達式性學習的,我將不勝感激,如果你能糾正我,如果我在錯誤的軌道還是我經常expressiona是錯誤的。 非常感謝,夥計們。
嘗試避免解析HTML的正則表達式。 – 2010-04-11 21:07:21
你應該真的考慮不要使用正則表達式來完成這個任務,而是一個(x)html解析器。請參閱http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – 2010-04-11 21:08:22
謝謝Phild。 我會接受你的建議。我只對我的第二個查詢感到好奇,因爲標記很簡單,這是關於獲取產品信息的問題。 –
Kevin
2010-04-11 21:15:40