How to extract terms from an HTML document

Posted by bookcasey on Super User See other posts from Super User or by bookcasey
Published on 2012-06-21T14:28:18Z Indexed on 2012/06/21 15:18 UTC
Read the original article Hit count: 423

Filed under:
|
|
|

I have a HTML document filled with terms that I need to put into a spreadsheet.

They follow this basic pattern:

<ul>
     <li class="name"><a href="spot.html">Spot</a></li>
     <li class="type">Dog</li>
     <li class="color">Red</li>
</ul>
<ul>
     <li class="name"><a href="mittens.html">Mittens</a></li>
     <li class="type">Cat</li>
     <li class="color">Brown</li>
</ul>
<ul>
     <li class="name"><a href="squakers.html">Squakers</a></li>
     <li class="type">Little Parrot</li>
     <li class="color">Rainbow</li>
</ul>

It's very consistent.

I need to extract the string within the li.name a (so, "Spot") but only if the type is "Dog" or "Parrot", and put them in a spreadsheet.

I've been trying to use Sublime Text's ability to Find with regex, but I'm really struggling, and since regex and HTML usually don't play nice, I was wondering if there is a better and easier way to accomplish this. Thanks.

© Super User or respective owner

Related posts about osx

Related posts about html