Finding the lowest average Hamming distance when the order of the strings matter
- by user1049697
I have a sequence of binary strings that I want to find a match for among a set of longer sequences of binary strings. A match means that the compared sequence gives the lowest average Hamming distance when all elements in the shorter sequence have been matched against a sequence in one of the longer sets.
Let me try to explain with an example. I have a set of video frames that have been hashed using a perceptual hashing algorithm so that the video frames that look the same has roughly the same hash. I want to match a short video clip against a set of longer videos, to see if the clip is contained in one of these. This means that I need to find out where the sequence of the hashed frames in the short video has the lowest average Hamming distance when compared with the long videos.
The short video is the sub strings Sub1, Sub2 and Sub3, and I want to match them against the hashes of the long videos in Src. The clue here is that the strings need to match in the specific order that they are given in, e.g. that Sub1 always has to match the element before Sub2, and Sub2 always has to match the element before Sub3. In this example it would map thusly: Sub1-Src3, Sub2-Src4 and Sub3-Src5.
So the question is this: is there an algorithm for finding the lowest average Hamming distance when the order of the elements compared matter? The naïve approach to compare the substring sequence to every source string won't cut it of course, so I need something that preferably can match a (much) shorter sub string to a set of million of elements. I have looked at MVP-trees, BK-trees and similar, but everything seems to only take into account one binary string and not a sequence of them.
Sub1: 100111011111011101
Sub2: 110111000010010100
Sub3: 111111010110101101
Src1: 001011010001010110
Src2: 010111101000111001
Src3: 101111001110011101
Src4: 010111100011010101
Src5: 001111010110111101
Src6: 101011111111010101
I have added a calculation of the examples below. (The Hamming distances aren't correct, but it doesn't matter)
**Run 1.**
dist(Sub1, Src1) = 8
dist(Sub2, Src2) = 10
dist(Sub3, Src3) = 12
average = 10
**Run 2.**
dist(Sub1, Src2) = 10
dist(Sub2, Src3) = 12
dist(Sub3, Src4) = 10
average = 11
**Run 3.**
dist(Sub1, Src3) = 7
dist(Sub2, Src4) = 6
dist(Sub3, Src5) = 10
average = 8
**Run 4.**
dist(Sub1, Src3) = 10
dist(Sub2, Src4) = 4
dist(Sub3, Src5) = 2
average = 5
So the winner here is sequence 4 with an average distance of 5.