(Python) Extracting Text from Source Code?

Posted by zhuyxn on Stack Overflow See other posts from Stack Overflow or by zhuyxn
Published on 2012-06-08T04:38:42Z Indexed on 2012/06/08 4:40 UTC
Read the original article Hit count: 101

Filed under:
|
|
|

Currently have a large webpage whose source code is ~200,000 lines of almost all (if not all) HTML. More specifically, it is a webpage whose content is a few thousand blocks of paragraphs separated by line breaks (though a line break does not specifically mean there is a separation in content)

My main objective is to extract text from the source code as if I were copying/pasting the webpage into a text editor. There is another parsing function I would like to use, which originally took in copied/pasted text rather than the source code.

To do this, I'm currently using urllib2, and calling .get_text() in Beautiful Soup. The problem is, Beautiful Soup is leaving tremendous amounts of white space in my code, and it is difficult to pass the result into the second "text" parser. I have done quite a bit of research on parsing HTMLs, but I'm frankly not sure how to solve this problem easily. Furthermore, I'm a bit confused on how to use imports like lxml to extract text as if I were to simply copy and paste?

© Stack Overflow or respective owner

Related posts about python

Related posts about html