Algorithm to find the percentage of how much two texts are identical

Posted by qster on Stack Overflow See other posts from Stack Overflow or by qster
Published on 2010-04-03T05:33:34Z Indexed on 2010/04/03 5:43 UTC
Read the original article Hit count: 444

Filed under:

algorithm

|

language-agnostic

What algorithm would you suggest to identify how much from 0 to 1 (float) two texts are identical?

Note that I don't mean similar (ie, they say the same thing but in a different way), I mean exact same words, but one of the two texts could have extra words or words slightly different or extra new lines and stuff like that.

A good example of the algorithm I want is the one google uses to identify duplicate content in websites (X search results very similar to the ones shown have been omitted, click here to see them).

The reason I need it is because my website has the ability for users to post comments; similar but different pages currently have their own comments, so many users ended up copy&pasting their comments on all the similar pages. Now I want to merge them (all similar pages will "share" the comments, and if you post it on page A it will appear on similar page B), and I would like to programatically erase all those copy&pasted comments from the same user.

I have quite a few million comments but speed shouldn't be an issue since this is a one time thing that will run in the background.

The programming language doesn't really matter (as long as it can interface to a MySQL database), but I was thinking of doing it in C++.

© Stack Overflow or respective owner

Related posts about algorithm

Jpeg Algorithm vs BMP Algorithm?

as seen on Super User - Search for 'Super User'
I'm just wonder, what the differences are between creating a BMP file algorithm and JPG file algorithm ? If you know the others images' format algorithm, please post them. Thanks. >>> More
word disambiguation algorithm (Lesk algorithm)

as seen on Stack Overflow - Search for 'Stack Overflow'
Hii.. Can anybody help me to find an algorithm in Java code to find synonyms of a search word based on the context and I want to implement the algorithm with WordNet database. For example, "I am running a Java program". From the context, I want to find the synonyms for the word "running", but the… >>> More
Search algorithm (with a sort algorithm already implemented)

as seen on Stack Overflow - Search for 'Stack Overflow'
Hello, Im doing a Java application and Im facing some doubts in which concerns performance. I have a PriorityQueue which guarantees me the element removed is the one with greater priority. That PriorityQueue has instances of class Event (which implements Comparable interface). Each Event is associated… >>> More
Is there any algorithm for finding LINES by PIXEL COLORS on picture?

as seen on Stack Overflow - Search for 'Stack Overflow'
So I have Image like this I want to get something like this (I hevent drawn all lines I want but I hope you can get my idea) I need algorithm for finding all straight lines on it by just reading colors of pixels. No hard math, no Haar, no Hough. Some algorithm which would be based on points… >>> More
collsion issues with quadtree [on hold]

as seen on Game Development - Search for 'Game Development'
So i implemented a Quad tree in Java for my 2D game and everything works fine except for when i run my collision detection algorithm, which checks if a object has hit another object and which side it hit.My problem is 80% of the time the collision algorithm works but sometimes the objects just go… >>> More

Related posts about language-agnostic

Are there any language agnostic unit testing frameworks?

as seen on Programmers - Search for 'Programmers'
I have always been skeptical of rewriting working code - porting code is no exception to this. However, with the advent of TDD and automated testing it is much more reasonable to rewrite and refactor code. Does anyone know if there is a TDD tool that can be used for porting old code? Ideally you… >>> More
Language Agnostic Basic Programming Question

as seen on Stack Overflow - Search for 'Stack Overflow'
This is very basic question from programming point of view but as I am in learning phase, I thought I would better ask this question rather than having a misunderstanding or narrow knowledge about the topic. So do excuse me if somehow I mess it up. Question: Let's say I have class A,B,C and D… >>> More
Language-agnostic term for typed things that need memory

as seen on Stack Overflow - Search for 'Stack Overflow'
Is there an accepted general term that subsumes the concepts of variables, class instances and arrays? Basically "any typed thing that needs memory". In C++, such a thing is called an object, but I'm looking for a more language-agnostic term. § 1.8 The C++ object model 1 The constructs in… >>> More
Deprecated Methods in Code Base

as seen on Programmers - Search for 'Programmers'
A lot of the code I've been working on recently, both professionally (read: at work) and in other spheres (read: at home, for friends/family/etc, or NOT FOR WORK), has been worked on, redesigned and re-implemented several times - where possible/required. This has been in an effort to make things smaller… >>> More
Coding Competition, language agnostic guidelines?

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi there: I might be doing a coding competition soon, I was wondering if anyone made one and what where the guidelines/ process. I'd like to make the competition appealing to all devs, and I m trying to come up with ideas as to how. the scenario is: There is an event running and we(of the coding… >>> More