Parsing HTML using HtmlParser
- by Blankman
My html has 20 or so rows of the following HTML pattern. So the below is considered a single instance of the pattern. Each instance of this pattern represents a product. Again the below is a single instance, it spans multiple rows in the HTML table.
<table>
..
<!-- product starts here, this html comment is not in the real html -->
<tr>
<td rowspan="5" class="product" valign="top"><nobr> ????????????</td>
</tr>
<tr>
<td class="title" ??????????>?????????</td>
<td class="title" ??????????>?????????</td>
<td class="title" ??????????>?????????</td>
<td class="title" ??????????>?????????</td>
<td class="title" ??????????>?????????</td>
<td class="title" ??????????>?????????</td>
</tr>
<tr>
<td class="data" ?????? </td>
<td class="data" ?????? </td>
<td class="data" ?????? </td>
<td class="data" ?????? </td>
<td class="data" ?????? </td>
<td class="data" ?????? </td>
</tr>
</tr>
<tr>
<td colspan="5" ????????</td>
</tr>
<tr>
<td colspan="6" width="100%"> <hr></td>
</tr>
<!-- product ends here, this html comment is not in the real html -->
<!-- above pattern repeats multiple times in the HTML -->
..
<table>
I am trying to use HtmlParser for this.
Parser rowParser = new Parser();
rowParser.setInputHtml(page.getHtml()); // page object represents a html page
rowParser.setEncoding("UTF-8");
NodeFilter productRowFilter = new AndFilter(
new TagNameFilter("tr"),
new HasChildFilter(
new AndFilter(
new TagNameFilter("td"),
new HasAttributeFilter("class", "product")))
The above filter doesn't work, just showing you what I have so far.
I need to somehow combine these filters, and use the last td to mark the end of the pattern i.e. the td with the colspan=6 and width=100% with child element hr.
I have been struggling with this, and have resorted to Regex'ing but was told numerous times to NOT use regex for html parsing, so here I am!
Your help is much appreciated!