Algorithm to match natural text in mail
Posted
by
snøreven
on Stack Overflow
See other posts from Stack Overflow
or by snøreven
Published on 2012-04-06T16:44:00Z
Indexed on
2012/04/06
17:29 UTC
Read the original article
Hit count: 262
I need to separate natural, coherent text/sentences in emails from lists, signatures, greetings and so on before further processing.
example:
Hi tom,
last monday we did bla bla, lore Lorem ipsum dolor sit amet, consectetur adipisici elit, sed eiusmod tempor incidunt ut labore et dolore magna aliqua.
- list item 2
- list item 3
- list item 3
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquid x ea commodi consequat. Quis aute iure reprehenderit in voluptate velit
regards, K.
---line-of-funny-characters-#######
example inc.
33 evil street, london
mobile: 00 234534/234345
Ideally the algorithm would match only the bold parts.
Is there any recommended approach - or are there even existing algorithms for that problem? Should I try approximate regular expressions or more statistical stuff based on number of punctation marks, length and so on?
© Stack Overflow or respective owner