Algorithm detect repeating/similiar strings in a corpus of data -- say email subjects, in Python
- by RizwanK
I'm downloading a long list of my email subject lines , with the intent of finding email lists that I was a member of years ago, and would want to purge them from my Gmail account (which is getting pretty slow.)
I'm specifically thinking of newsletters that often come from the same address, and repeat the product/service/group's name in the subject.
I'm aware that I could search/sort by the common occurrence of items from a particular email address (and I intend to), but I'd like to correlate that data with repeating subject lines....
Now, many subject lines would fail a string match, but
"Google Friends : Our latest news"
"Google Friends : What we're doing today"
are more similar to each other than a random subject line, as is:
"Virgin Airlines has a great sale today"
"Take a flight with Virgin Airlines"
So -- how can I start to automagically extract trends/examples of strings that may be more similar.
Approaches I've considered and discarded ('because there must be some better way'):
Extracting all the possible substrings and ordering them by how often they show up, and manually selecting relevant ones
Stripping off the first word or two and then count the occurrence of each sub string
Comparing Levenshtein distance between entries
Some sort of string similarity index ...
Most of these were rejected for massive inefficiency or likelyhood of a vast amount of manual intervention required. I guess I need some sort of fuzzy string matching..?
In the end, I can think of kludgy ways of doing this, but I'm looking for something more generic so I've added to my set of tools rather than special casing for this data set.
After this, I'd be matching the occurring of particular subject strings with 'From' addresses - I'm not sure if there's a good way of building a data structure that represents how likely/not two messages are part of the 'same email list' or by filtering all my email subjects/from addresses into pools of likely 'related' emails and not -- but that's a problem to solve after this one.
Any guidance would be appreciated.