Beginner questions on Java Regular Expression

Posted by Robert on Stack Overflow See other posts from Stack Overflow or by Robert
Published on 2010-04-11T21:01:52Z Indexed on 2010/04/11 21:13 UTC
Read the original article Hit count: 345

Filed under:
|
|

Hello everyone. I began studying Java Regular Expression recently and I found some really intersting task.For example,I now need to dig out "Product Name","Product Description" and "Sellers for this product" out of the following HTML code.(I am sorry for the big chunck of code,but it is very straightforward)

<td class="sr-check">
<input type="checkbox" name="cptitle" value="678560038" /></td>
<td class="sr-image" style="width: 80px;"><a href="/Nikon-D300S-12-3-678560038/prices-html"     class="strictRule" rel="nofollow"><img src="http://img01.static-nextag.com/image/Nikon-D300S-12-3-MP-Digital-SLR-Camera-Body-Black/0/000/006/789/461/678946110.jpg" alt="Nikon D300S 12.3 MP Digital SLR Camera Body - Black" class="imageLink strictRule" height="75" width="75" id="opILink_0" title="Nikon Digital Cameras - Nikon D300S 12.3 MP Digital SLR Camera Body - Black" /></a><div class="breaker">&nbsp;</div></td>
<td class="sr-info">
<div class="sr-info">
<a id="opPNLink_0" class="underline" style="font-size:16px" href="/Nikon-D300S-12-3-678560038  /prices-html" >Nikon D300S 12.3 MP <b>Digital</b> SLR <b>Camera</b> Body - Black</a> <div class="sr-subinfo">
<div class="sr-info-description">SLR - 13.1MP, 12.3MP - 1x Optical Zoom - CompactFlash, SD/MMC Memory Card - 3in.</div>
<div class="rating">
<img src="http://img01.static-nextag.com/imagefiles/stars/stars4_10px.gif" alt="4/5 stars" title="4/5 stars" /> (92 user ratings)</div>
<div style="clear: both;">
<!-- nxtginc=nextag.api.ServerInclude$JSPIncludeWriter(/buyer/ATLSSI.jsp?ptid=678560038&dts=y) -->
<a id="_atl_0" style="" href="http://www.nextag.com/serv/main/buyer/MyPDir.jsp?list=_transCookieList&amp;cmd=add&amp;ptitle=678560038" rel="nofollow">+ Add to Shopping List</a> &nbsp;|&nbsp; 
<!-- endnxtginc -->
<a rel="nofollow" id="mltLink_0" class="mlt-link" href="/Digital-Cameras--zz500001z2z678560038zB2dgz5---html">See More Like This</a>
</div>
<div id="fsLink_0" class="featuredSeller">
<a rel="nofollow" class="featuredSeller" id="opFSLink_0_0" href="/norob/PtitleSeller.jsp?chnl=main&amp;tag=785646073amp;ctx=x%2BN%2Fs9zy56l4u8RXCzALE1jeLesDMzeK09rPQEdK3Yjx395ZzX9cMh9N5JAxjk7xPqF9hjk2ztM5IRXU5nspLubIXYaVzI%2B%2Fg7h1Qz58TzgvrWuNawV8qEIqqSmClArWMq6mpzNRuSlgg2xCXYObNnaIH00iKSUmBawDRvecwbCpAxhXgXoLEiEinTwr3EipComdzxL9UHFYTLoWUToUB5SRSsolQmEJ3mgnnvu83%2FC8W34TGpN9mJo%2BnyAeTkt4&amp;ptitle=678560038"  target="_blank" >Thundercameras</a>:$1,289 &nbsp;
<a rel="nofollow" class="featuredSeller" id="opFSLink_0_1" href="/norob/PtitleSeller.jsp?chnl=main&amp;tag=797076595&amp;ctx=x%2BN%2Fs9zy56l4u8RXCzALE1jeLesDMzeK09rPQEdK3Yjx395ZzX9cMh9N5JAxjk7xPqF9hjk2ztM5IRXU5nspLubIXYaVzI%2B%2Fg7h1Qz58TzgvrWuNawV8qEIqqSmClArWMq6mpzNRuSlgg2xCXYObNrcWLhL%2BhryuAGhXNhYSPE%2BpAxhXgXoLEiEinTwr3EipComdzxL9UHFYTLoWUToUB5SRSsolQmEJ3mgnnvu83%2FC8W34TGpN9mJo%2BnyAeTkt4&amp;ptitle=678560038"  target="_blank" >PhotoVideoSuperStore</a>:$1,269 &nbsp;
<a rel="nofollow" class="featuredSeller" id="opFSLink_0_2" href="/norob/PtitleSeller.jsp?chnl=main&amp;tag=803555293&amp;ctx=x%2BN%2Fs9zy56l4u8RXCzALE1jeLesDMzeK09rPQEdK3Yjx395ZzX9cMh9N5JAxjk7xPqF9hjk2ztM5IRXU5nspLubIXYaVzI%2B%2Fg7h1Qz58TzgvrWuNawV8qEIqqSmClArWMq6mpzNRuSlgg2xCXYObNt06qcvLJ5UQz7S3zKd4urWpAxhXgXoLEiEinTwr3EipComdzxL9UHFYTLoWUToUB5SRSsolQmEJ3mgnnvu83%2FC8W34TGpN9mJo%2BnyAeTkt4&amp;ptitle=678560038"  target="_blank" >Digitalelect</a>:$1,279 &nbsp;</div>

I would think of :

(1) digging out the product name from <td class="sr-image >tag,and using regular expression

exp ="<td><span\\s+class=\"sr-image\"[^>]*>"
          + ".*?</span><a href=\""
          + "([^\"]+)"      
          + "\"[^>]*>"      
          + "([^<]+)" + "</a>.*?</td>";

(2) digging out the product info from the <div class="sr-info-description"> tag.

exp = "<div class="sr-info-description"> [^>]*>"

(3) digging out the Sellers' names from <div id="fsLink_0" class="featuredSeller"> tag.

exp = "<div id="fslink_0" class="featuredSeller[^>]*>"
          + ".*?</span><a rel=\""
          + "([^\"]+)"      
          + "\"[^>]*>"      
          + "([^<]+)" + "</a>.*?</td>";

I am just beginning learing using Java Regular Expression,I would be grateful if you could correct me if I am in the wrong track or my regular expressiona are wrong. Thanks a lot,guys.

© Stack Overflow or respective owner

Related posts about java

Related posts about html