I am using MinHash algorithm to find similar images between images.
I have run across this post, How can I recognize slightly modified images? which pointed me to MinHash algorithm.
Being a bit mathematically challenged, I was using a C# implementation from this blog post, Set Similarity and Min Hash.
But while trying to use the implementation, I have run into 2 problems.
What value should I set universe value to?
When passing image byte array to HashSet, it only contains distinct byte values; thus comparing values from 1 ~ 256.
What is this universe in MinHash?
And what can I do to improve the C# MinHash implementation?
Since HashSet<byte> contains values upto 256, similarity value always come out to 1.
Here is the source that uses the C# MinHash implementation from Set Similarity and Min Hash:
class Program
{
static void Main(string[] args)
{
var imageSet1 = GetImageByte(@".\Images\01.JPG");
var imageSet2 = GetImageByte(@".\Images\02.TIF");
//var app = new MinHash(256);
var app = new MinHash(Math.Min(imageSet1.Count, imageSet2.Count));
double imageSimilarity = app.Similarity(imageSet1, imageSet2);
Console.WriteLine("similarity = {0}", imageSimilarity);
}
private static HashSet<byte> GetImageByte(string imagePath)
{
using (var fs = new FileStream(imagePath, FileMode.Open, FileAccess.Read))
using (var br = new BinaryReader(fs))
{
//List<int> bytes = br.ReadBytes((int)fs.Length).Cast<int>().ToList();
var bytes = new List<byte>(br.ReadBytes((int) fs.Length).ToArray());
return new HashSet<byte>(bytes);
}
}
}