Optimize Duplicate Detection

Posted by Dave Jarvis on Stack Overflow See other posts from Stack Overflow or by Dave Jarvis
Published on 2012-10-23T01:37:35Z Indexed on 2012/10/23 5:04 UTC
Read the original article Hit count: 165

Background

This is an optimization problem. Oracle Forms XML files have elements such as:

<Trigger TriggerName="name" TriggerText="SELECT * FROM DUAL" ... />

Where the TriggerText is arbitrary SQL code. Each SQL statement has been extracted into uniquely named files such as:

sql/module=DIAL_ACCESS+trigger=KEY-LISTVAL+filename=d_access.fmb.sql     
sql/module=REP_PAT_SEEN+trigger=KEY-LISTVAL+filename=rep_pat_seen.fmb.sql 

I wrote a script to generate a list of exact duplicates using a brute force approach.

Problem

There are 37,497 files to compare against each other; it takes 8 minutes to compare one file against all the others. Logically, if A = B and A = C, then there is no need to check if B = C. So the problem is: how do you eliminate the redundant comparisons?

The script will complete in approximately 208 days.

Script Source Code

The comparison script is as follows:

#!/bin/bash

echo Loading directory ...

for i in $(find sql/ -type f -name \*.sql); do
        echo Comparing $i ...

        for j in $(find sql/ -type f -name \*.sql); do
                if [ "$i" = "$j" ]; then
                        continue;
                fi

                # Case insensitive compare, ignore spaces
                diff -IEbwBaq $i $j > /dev/null

                # 0 = no difference (i.e., duplicate code)
                if [ $? = 0 ]; then
                        echo $i :: $j >> clones.txt
                fi
        done
done

Question

How would you optimize the script so that checking for cloned code is a few orders of magnitude faster?

System Constraints

Using a quad-core CPU with an SSD; trying to avoid using cloud services if possible. The system is a Windows-based machine with Cygwin installed -- algorithms or solutions in other languages are welcome.

Thank you!

© Stack Overflow or respective owner

Related posts about java

Related posts about perl