Algorithm to match natural text in mail
- by snøreven
I need to separate natural, coherent text/sentences in emails from lists, signatures, greetings and so on before further processing.
example:
Hi tom,
last monday we did bla bla, lore Lorem ipsum dolor sit amet, consectetur adipisici elit, sed eiusmod tempor incidunt ut labore et
dolore magna aliqua.
list item 2
list item 3
list item 3
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquid x ea commodi consequat. Quis aute iure reprehenderit
in voluptate velit
regards, K.
---line-of-funny-characters-#######
example inc.
33 evil street, london
mobile: 00 234534/234345
Ideally the algorithm would match only the bold parts.
Is there any recommended approach - or are there even existing algorithms for that problem? Should I try approximate regular expressions or more statistical stuff based on number of punctation marks, length and so on?