Good way to identify similar images?
- by Nick
I've developed a simple and fast algorithm in PHP to compare images for similarity.
Its fast (~40 per second for 800x600 images) to hash and a unoptimised search algorithm can go through 3,000 images in 22 mins comparing each one against the others (3/sec).
The basic overview is you get a image, rescale it to 8x8 and then convert those pixels for HSV. The Hue, Saturation and Value are then truncated to 4 bits and it becomes one big hex string.
Comparing images basically walks along two strings, and then adds the differences it finds. If the total number is below 64 then its the same image. Different images are usually around 600 - 800. Below 20 and extremely similar.
Are there any improvements upon this model I can use?
I havent looked at how relevant the different components (hue, saturation and value) are to the comparison. Hue is probably quite important but the others?
To speed up searches I could probably split the 4 bits from each part in half, and put the most significant bits first so if they fail the check then the lsb doesnt need to be checked at all. I dont know a efficient way to store bits like that yet still allow them to be searched and compared easily.
I've been using a dataset of 3,000 photos (mostly unique) and there havent been any false positives. Its completely immune to resizes and fairly resistant to brightness and contrast changes.