Extract all text from a HTML page without losing context

Posted by grmbl on Stack Overflow See other posts from Stack Overflow or by grmbl
Published on 2010-05-07T03:03:33Z Indexed on 2010/05/07 3:08 UTC
Read the original article Hit count: 331

Filed under:
|
|
|

For a translation program I am trying to get a 95% accurate text from a HTML file in order to translate the sentences and links.

For example:

<div><a href="stack">Overflow</a> <span>Texts <b>go</b> here</span></div>

Should give me 2 results to translate:

Overflow

Texts <b>go</b> here

Any suggestions or commercial packages available for this problem?

© Stack Overflow or respective owner

Related posts about multilanguage

Related posts about language