How to parse malformed HTML in python, using standard libraries
Posted
by bukzor
on Stack Overflow
See other posts from Stack Overflow
or by bukzor
Published on 2010-04-20T16:29:21Z
Indexed on
2010/04/20
16:33 UTC
Read the original article
Hit count: 493
There are so many html and xml libraries built into python, that it's hard to believe there's no support for real-world HTML parsing.
I've found plenty of great third-party libraries for this task, but this question is about the python standard library.
Requirements:
- Use only Python standard library components (I'm currently using v2.6)
- DOM support
- Handle HTML entities (
) - Handle partial documents (like:
Hello, <i>World</i>!
)
Bonus points:
- XPATH support
- Handle unclosed/malformed tags. (
<big>does anyone here know <html ???
© Stack Overflow or respective owner