How to parse malformed HTML in python, using standard libraries

Posted by bukzor on Stack Overflow See other posts from Stack Overflow or by bukzor
Published on 2010-04-20T16:29:21Z Indexed on 2010/04/20 16:33 UTC
Read the original article Hit count: 493

Filed under:
|
|
|
|

There are so many html and xml libraries built into python, that it's hard to believe there's no support for real-world HTML parsing.

I've found plenty of great third-party libraries for this task, but this question is about the python standard library.

Requirements:

  • Use only Python standard library components (I'm currently using v2.6)
  • DOM support
  • Handle HTML entities ( )
  • Handle partial documents (like: Hello, <i>World</i>!)

Bonus points:

  • XPATH support
  • Handle unclosed/malformed tags. (<big>does anyone here know <html ???

© Stack Overflow or respective owner

Related posts about python

Related posts about html