Algorithm to match natural text in mail

Posted by snøreven on Stack Overflow See other posts from Stack Overflow or by snøreven
Published on 2012-04-06T16:44:00Z Indexed on 2012/04/06 17:29 UTC
Read the original article Hit count: 262

Filed under:
|
|
|

I need to separate natural, coherent text/sentences in emails from lists, signatures, greetings and so on before further processing.

example:

Hi tom,

last monday we did bla bla, lore Lorem ipsum dolor sit amet, consectetur adipisici elit, sed eiusmod tempor incidunt ut labore et dolore magna aliqua.

  • list item 2
  • list item 3
  • list item 3

Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquid x ea commodi consequat. Quis aute iure reprehenderit in voluptate velit

regards, K.

---line-of-funny-characters-#######

example inc.

33 evil street, london

mobile: 00 234534/234345

Ideally the algorithm would match only the bold parts.

Is there any recommended approach - or are there even existing algorithms for that problem? Should I try approximate regular expressions or more statistical stuff based on number of punctation marks, length and so on?

© Stack Overflow or respective owner

Related posts about python

Related posts about regex