How to extract terms from an HTML document
- by bookcasey
I have a HTML document filled with terms that I need to put into a spreadsheet.
They follow this basic pattern:
<ul>
<li class="name"><a href="spot.html">Spot</a></li>
<li class="type">Dog</li>
<li class="color">Red</li>
</ul>
<ul>
<li class="name"><a href="mittens.html">Mittens</a></li>
<li class="type">Cat</li>
<li class="color">Brown</li>
</ul>
<ul>
<li class="name"><a href="squakers.html">Squakers</a></li>
<li class="type">Little Parrot</li>
<li class="color">Rainbow</li>
</ul>
It's very consistent.
I need to extract the string within the li.name a (so, "Spot") but only if the type is "Dog" or "Parrot", and put them in a spreadsheet.
I've been trying to use Sublime Text's ability to Find with regex, but I'm really struggling, and since regex and HTML usually don't play nice, I was wondering if there is a better and easier way to accomplish this. Thanks.