Text extraction with java html parsers

Posted by zenmonkey on Stack Overflow See other posts from Stack Overflow or by zenmonkey
Published on 2010-04-09T18:37:38Z Indexed on 2010/04/09 18:53 UTC
Read the original article Hit count: 533

Filed under:
|
|
|
|

I want to use an html parser that does the following in a nice, elegant way

  1. Extract text (this is most important)
  2. Extract links, meta keywords
  3. Reconstruct original doc (optional but nice feature to have)

From my investigation so far jericho seems to fit. Any other open source libraries you guys would recommend?

© Stack Overflow or respective owner

Related posts about java

Related posts about html