Search Results

Search found 1 results on 1 pages for 'gapple'.

Page 1/1 | 1 

  • Optimizing near-duplicate value search

    - by GApple
    I'm trying to find near duplicate values in a set of fields in order to allow an administrator to clean them up. There are two criteria that I am matching on One string is wholly contained within the other, and is at least 1/4 of its length The strings have an edit distance less than 5% of the total length of the two strings The Pseudo-PHP code: foreach($values as $value){ foreach($values as $match){ if( ( $value['length'] < $match['length'] && $value['length'] * 4 > $match['length'] && stripos($match['value'], $value['value']) !== false ) || ( $match['length'] < $value['length'] && $match['length'] * 4 > $value['length'] && stripos($value['value'], $match['value']) !== false ) || ( abs($value['length'] - $match['length']) * 20 < ($value['length'] + $match['length']) && 0 < ($match['changes'] = levenshtein($value['value'], $match['value'])) && $match['changes'] * 20 <= ($value['length'] + $match['length']) ) ){ $matches[] = &$match; } } } I've tried to reduce calls to the comparatively expensive stripos and levenshtein functions where possible, which has reduced the execution time quite a bit. However, as an O(n^2) operation this just doesn't scale to the larger sets of values and it seems that a significant amount of the processing time is spent simply iterating through the arrays. Some properties of a few sets of values being operated on Total | Strings | # of matches per string | | Strings | With Matches | Average | Median | Max | Time (s) | --------+--------------+---------+--------+------+----------+ 844 | 413 | 1.8 | 1 | 58 | 140 | 593 | 156 | 1.2 | 1 | 5 | 62 | 272 | 168 | 3.2 | 2 | 26 | 10 | 157 | 47 | 1.5 | 1 | 4 | 3.2 | 106 | 48 | 1.8 | 1 | 8 | 1.3 | 62 | 47 | 2.9 | 2 | 16 | 0.4 | Are there any other things I can do to reduce the time to check criteria, and more importantly are there any ways for me to reduce the number of criteria checks required (for example, by pre-processing the input values), since there is such low selectivity?

    Read the article

1