Algorithm to find the percentage of how much two texts are identical

Posted by qster on Stack Overflow See other posts from Stack Overflow or by qster
Published on 2010-04-03T05:33:34Z Indexed on 2010/04/03 5:43 UTC
Read the original article Hit count: 389

What algorithm would you suggest to identify how much from 0 to 1 (float) two texts are identical?

Note that I don't mean similar (ie, they say the same thing but in a different way), I mean exact same words, but one of the two texts could have extra words or words slightly different or extra new lines and stuff like that.

A good example of the algorithm I want is the one google uses to identify duplicate content in websites (X search results very similar to the ones shown have been omitted, click here to see them).

The reason I need it is because my website has the ability for users to post comments; similar but different pages currently have their own comments, so many users ended up copy&pasting their comments on all the similar pages. Now I want to merge them (all similar pages will "share" the comments, and if you post it on page A it will appear on similar page B), and I would like to programatically erase all those copy&pasted comments from the same user.

I have quite a few million comments but speed shouldn't be an issue since this is a one time thing that will run in the background.

The programming language doesn't really matter (as long as it can interface to a MySQL database), but I was thinking of doing it in C++.

© Stack Overflow or respective owner

Related posts about algorithm

Related posts about language-agnostic