How to parse malformed HTML in python, using standard libraries
- by bukzor
There are so many html and xml libraries built into python, that it's hard to believe there's no support for real-world HTML parsing.
I've found plenty of great third-party libraries for this task, but this question is about the python standard library.
Requirements:
Use only Python standard library components (I'm currently using v2.6)
DOM support
Handle HTML entities ( )
Handle partial documents (like: Hello, <iWorld</i!)
Bonus points:
XPATH support
Handle unclosed/malformed tags. (<bigdoes anyone here know <html ???