lexers / parsers for (un) structured text documents

Posted by wilson32 on Stack Overflow See other posts from Stack Overflow or by wilson32
Published on 2010-01-18T16:57:00Z Indexed on 2010/05/17 6:20 UTC
Read the original article Hit count: 185

Filed under:
|
|

There are lots of parsers and lexers for scripts (i.e. structured computer languages). But I'm looking for one which can break a (almost) non-structured text document into larger sections e.g. chapters, paragraphs, etc.

It's relatively easy for a person to identify them: where the Table of Contents, acknowledgements, or where the main body starts and it is possible to build rule based systems to identify some of these (such as paragraphs).

I don't expect it to be perfect, but does any one know of such a broad 'block based' lexer / parser? Or could you point me in the direction of literature which may help?

© Stack Overflow or respective owner

Related posts about lexer

Related posts about parser