Optimizing near-duplicate value search
- by GApple
I'm trying to find near duplicate values in a set of fields in order to allow an administrator to clean them up.
There are two criteria that I am matching on
One string is wholly contained within the other, and is at least 1/4 of its length
The strings have an edit distance less than 5% of the total length of the two strings
The Pseudo-PHP…