lexers / parsers for (un) structured text documents
Posted
by wilson32
on Stack Overflow
See other posts from Stack Overflow
or by wilson32
Published on 2010-01-18T16:57:00Z
Indexed on
2010/05/17
6:20 UTC
Read the original article
Hit count: 184
There are lots of parsers and lexers for scripts (i.e. structured computer languages). But I'm looking for one which can break a (almost) non-structured text document into larger sections e.g. chapters, paragraphs, etc.
It's relatively easy for a person to identify them: where the Table of Contents, acknowledgements, or where the main body starts and it is possible to build rule based systems to identify some of these (such as paragraphs).
I don't expect it to be perfect, but does any one know of such a broad 'block based' lexer / parser? Or could you point me in the direction of literature which may help?
© Stack Overflow or respective owner