Best way to get back to using the power of lxml after having to use a regex to find something in an

Posted by PyNEwbie on Stack Overflow See other posts from Stack Overflow or by PyNEwbie
Published on 2010-03-10T23:13:03Z Indexed on 2010/03/17 23:51 UTC
Read the original article Hit count: 397

Filed under:

I am trying to rip some text out of a large number of html documents (numbers in the hundreds of thousands). The documents are really forms but they are prepared by a very large group of different organizations so there is significant variation in how they create the document. For example, the documents are divided into chapters. I might want to extract the contents of Chapter 5 from every document so I can analyze the content of the chapter. Initially I thought this would be easy but it turns out that the authors might use a set of non-nested tables throughout the document to hold the content so that Chapter n could be displayed using td tags inside a table. Or they might use other elements such as p tags H tags, div tags or any other block level element.

After trying repeatedly to use lxml to help me identify the beginning and end of each chapter I have determined that it is a lot cleaner to use a regular expression because in every case, no matter what the enclosing html element is the chapter label is always in the form of

>Chapter #

It is a little more complicated in that there might be some white space or non-breaking space represented in different ways ( or or just spaces). Nonetheless it was trivial to write a regular expression to identify the beginning of each section. (The beginning of one section is the end of the previous section.)

But now I want to use lxml to get the text out. My thought is that I have really no choice but to walk along my string to find the close tag for the element that encloses the text I am using to find the relevant section.

That is here is one example where the element holding the Chapter name is a div

<div style="DISPLAY: block; MARGIN-LEFT: 0pt; TEXT-INDENT: 0pt; MARGIN-RIGHT: 0pt" align="left"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT-SIZE: 10pt; FONT-FAMILY: Times New Roman">Chapter 1.&#160;&#160;&#160;Our Beginnings.</font></div>

So I am imagining that I would begin at the location where I found the match for chapter 1 and set up a regular expressions to find the next

</div|</td|</p|</h1 . . .

So at this point I have identified the type of element holding my chapter heading

I can use the same logic to find all of the text that is within that element that is set up a regular expression to help me mark from

>Chapter 1.&#160;&#160;&#160;Our Beginnings.<

So I have identified where my Chapter 1 begins

I can do the same for chapter 2 (which is where Chapter 1 ends)

Now I am imagining that I am going to snip the document beginning at the opening of the element that I identified as the element the indicates where chapter 1 begins and ending just before the opening of the element that I identified as the element that indicates where Chapter 2 begins. The string that I have identified will then be fed to lxml to use its power to get the content.

I am going to all of this trouble because I have read over and over - never use a regular expression to extract content from html documents and I have not hit on a way to be as accurate with lxml to identify the starting and ending locations for the text I want to extract. For example, I can never be certain that the subtitle of Chapter 1 is Our Beginnings it could be Our Red Canary. Let me say that I spent two solid days trying with lxml to be confident that I had the beginning and ending elements and I could only be accurate <60% of the time but a very short regular expression has given me better than 95% success.

I have a tendency to make things more complicated than necessary so I am wondering if anyone has seen or solved a similar problems and if they had an approach (not the details mind you) that they would like to offer.

Developer IT

Best way to get back to using the power of lxml after having to use a regex to find something in an - Developer IT

Best way to get back to using the power of lxml after having to use a regex to find something in an

python

lxml

regex

html-parsing

beginner

Related posts about python

unmet dependencies in Ubuntu 12.04

How can I get sikuli-ide to work?

Getting PATH right for python after MacPorts install

call python with system() in R to run a python script emulating the python console

Python - Calling a non python program from python?

Related posts about lxml

how to pass an xml file to lxml to parse?

Python question, how to pass an xml file to lxml to parse?

Installing both lxml 3.1.2 and lxml2 on ubuntu 12.04

Default or fink python and lxml under 10.6.8

LXML E builder for java?

Categories cloud