Algorithm detect repeating/similiar strings in a corpus of data -- say email subjects, in Python

Posted by RizwanK on Stack Overflow See other posts from Stack Overflow or by RizwanK
Published on 2010-05-02T00:27:16Z Indexed on 2010/05/02 0:37 UTC
Read the original article Hit count: 308

Filed under:

I'm downloading a long list of my email subject lines , with the intent of finding email lists that I was a member of years ago, and would want to purge them from my Gmail account (which is getting pretty slow.)

I'm specifically thinking of newsletters that often come from the same address, and repeat the product/service/group's name in the subject.

I'm aware that I could search/sort by the common occurrence of items from a particular email address (and I intend to), but I'd like to correlate that data with repeating subject lines....

Now, many subject lines would fail a string match, but "Google Friends : Our latest news" "Google Friends : What we're doing today" are more similar to each other than a random subject line, as is: "Virgin Airlines has a great sale today" "Take a flight with Virgin Airlines"

So -- how can I start to automagically extract trends/examples of strings that may be more similar.

Approaches I've considered and discarded ('because there must be some better way'):

Extracting all the possible substrings and ordering them by how often they show up, and manually selecting relevant ones
Stripping off the first word or two and then count the occurrence of each sub string
Comparing Levenshtein distance between entries
Some sort of string similarity index ...

Most of these were rejected for massive inefficiency or likelyhood of a vast amount of manual intervention required. I guess I need some sort of fuzzy string matching..?

In the end, I can think of kludgy ways of doing this, but I'm looking for something more generic so I've added to my set of tools rather than special casing for this data set.

After this, I'd be matching the occurring of particular subject strings with 'From' addresses - I'm not sure if there's a good way of building a data structure that represents how likely/not two messages are part of the 'same email list' or by filtering all my email subjects/from addresses into pools of likely 'related' emails and not -- but that's a problem to solve after this one.

Any guidance would be appreciated.

Developer IT

Algorithm detect repeating/similiar strings in a corpus of data -- say email subjects, in Python - Developer IT

Algorithm detect repeating/similiar strings in a corpus of data -- say email subjects, in Python

data-mining

python

string

email

fuzzy

Related posts about data-mining

SQLAuthority News – Links to Book On Line – Data Mining Algorithms (Analysis Services – Data Mining)

Data Mining Resources

Integrating Data Mining into your BI Solution (Presentation)

What data mining tools do you use?

NEW 2-Day Instructor Led Course on Oracle Data Mining Now Available!

Related posts about python

unmet dependencies in Ubuntu 12.04

How can I get sikuli-ide to work?

Getting PATH right for python after MacPorts install

call python with system() in R to run a python script emulating the python console

Python - Calling a non python program from python?

Categories cloud