Beginner questions on Java Regular Expression
- by Robert
Hello everyone.
I began studying Java Regular Expression recently and I found some really intersting task.For example,I now need to dig out "Product Name","Product Description" and "Sellers for this product" out of the following HTML code.(I am sorry for the big chunck of code,but it is very straightforward)
<td class="sr-check">
<input type="checkbox" name="cptitle" value="678560038" /></td>
<td class="sr-image" style="width: 80px;"><a href="/Nikon-D300S-12-3-678560038/prices-html" class="strictRule" rel="nofollow"><img src="http://img01.static-nextag.com/image/Nikon-D300S-12-3-MP-Digital-SLR-Camera-Body-Black/0/000/006/789/461/678946110.jpg" alt="Nikon D300S 12.3 MP Digital SLR Camera Body - Black" class="imageLink strictRule" height="75" width="75" id="opILink_0" title="Nikon Digital Cameras - Nikon D300S 12.3 MP Digital SLR Camera Body - Black" /></a><div class="breaker"> </div></td>
<td class="sr-info">
<div class="sr-info">
<a id="opPNLink_0" class="underline" style="font-size:16px" href="/Nikon-D300S-12-3-678560038 /prices-html" >Nikon D300S 12.3 MP <b>Digital</b> SLR <b>Camera</b> Body - Black</a> <div class="sr-subinfo">
<div class="sr-info-description">SLR - 13.1MP, 12.3MP - 1x Optical Zoom - CompactFlash, SD/MMC Memory Card - 3in.</div>
<div class="rating">
<img src="http://img01.static-nextag.com/imagefiles/stars/stars4_10px.gif" alt="4/5 stars" title="4/5 stars" /> (92 user ratings)</div>
<div style="clear: both;">
<!-- nxtginc=nextag.api.ServerInclude$JSPIncludeWriter(/buyer/ATLSSI.jsp?ptid=678560038&dts=y) -->
<a id="_atl_0" style="" href="http://www.nextag.com/serv/main/buyer/MyPDir.jsp?list=_transCookieList&cmd=add&ptitle=678560038" rel="nofollow">+ Add to Shopping List</a> |
<!-- endnxtginc -->
<a rel="nofollow" id="mltLink_0" class="mlt-link" href="/Digital-Cameras--zz500001z2z678560038zB2dgz5---html">See More Like This</a>
</div>
<div id="fsLink_0" class="featuredSeller">
<a rel="nofollow" class="featuredSeller" id="opFSLink_0_0" href="/norob/PtitleSeller.jsp?chnl=main&tag=785646073amp;ctx=x%2BN%2Fs9zy56l4u8RXCzALE1jeLesDMzeK09rPQEdK3Yjx395ZzX9cMh9N5JAxjk7xPqF9hjk2ztM5IRXU5nspLubIXYaVzI%2B%2Fg7h1Qz58TzgvrWuNawV8qEIqqSmClArWMq6mpzNRuSlgg2xCXYObNnaIH00iKSUmBawDRvecwbCpAxhXgXoLEiEinTwr3EipComdzxL9UHFYTLoWUToUB5SRSsolQmEJ3mgnnvu83%2FC8W34TGpN9mJo%2BnyAeTkt4&ptitle=678560038" target="_blank" >Thundercameras</a>:$1,289
<a rel="nofollow" class="featuredSeller" id="opFSLink_0_1" href="/norob/PtitleSeller.jsp?chnl=main&tag=797076595&ctx=x%2BN%2Fs9zy56l4u8RXCzALE1jeLesDMzeK09rPQEdK3Yjx395ZzX9cMh9N5JAxjk7xPqF9hjk2ztM5IRXU5nspLubIXYaVzI%2B%2Fg7h1Qz58TzgvrWuNawV8qEIqqSmClArWMq6mpzNRuSlgg2xCXYObNrcWLhL%2BhryuAGhXNhYSPE%2BpAxhXgXoLEiEinTwr3EipComdzxL9UHFYTLoWUToUB5SRSsolQmEJ3mgnnvu83%2FC8W34TGpN9mJo%2BnyAeTkt4&ptitle=678560038" target="_blank" >PhotoVideoSuperStore</a>:$1,269
<a rel="nofollow" class="featuredSeller" id="opFSLink_0_2" href="/norob/PtitleSeller.jsp?chnl=main&tag=803555293&ctx=x%2BN%2Fs9zy56l4u8RXCzALE1jeLesDMzeK09rPQEdK3Yjx395ZzX9cMh9N5JAxjk7xPqF9hjk2ztM5IRXU5nspLubIXYaVzI%2B%2Fg7h1Qz58TzgvrWuNawV8qEIqqSmClArWMq6mpzNRuSlgg2xCXYObNt06qcvLJ5UQz7S3zKd4urWpAxhXgXoLEiEinTwr3EipComdzxL9UHFYTLoWUToUB5SRSsolQmEJ3mgnnvu83%2FC8W34TGpN9mJo%2BnyAeTkt4&ptitle=678560038" target="_blank" >Digitalelect</a>:$1,279 </div>
I would think of :
(1) digging out the product name from <td class="sr-image >tag,and using regular expression
exp ="<td><span\\s+class=\"sr-image\"[^>]*>"
+ ".*?</span><a href=\""
+ "([^\"]+)"
+ "\"[^>]*>"
+ "([^<]+)" + "</a>.*?</td>";
(2) digging out the product info from the <div class="sr-info-description"> tag.
exp = "<div class="sr-info-description"> [^>]*>"
(3) digging out the Sellers' names from <div id="fsLink_0" class="featuredSeller"> tag.
exp = "<div id="fslink_0" class="featuredSeller[^>]*>"
+ ".*?</span><a rel=\""
+ "([^\"]+)"
+ "\"[^>]*>"
+ "([^<]+)" + "</a>.*?</td>";
I am just beginning learing using Java Regular Expression,I would be grateful if you could correct me if I am in the wrong track or my regular expressiona are wrong.
Thanks a lot,guys.