How to calculate this string-dissimilarity function efficiently?
- by ybungalobill
Hello,
I was looking for a string metric that have the property that moving around large blocks in a string won't affect the distance so much. So "helloworld" is close to "worldhello". Obviously Levenshtein distance and Longest common subsequence don't fulfill this requirement. Using Jaccard distance on the set of n-grams gives good results but has other drawbacks (it's a pseudometric and higher n results in higher penalty for changing single character).
[original research]
As I thought about it, what I'm looking for is a function f(A,B) such that f(A,B)+1 equals the minimum number of blocks that one have to divide A into (A1 ... An), apply a permutation on the blocks and get B:
f("hello", "hello") = 0
f("helloworld", "worldhello") = 1 // hello world -> world hello
f("abba", "baba") = 2 // ab b a -> b ab a
f("computer", "copmuter") = 3 // co m p uter -> co p m uter
This can be extended for A and B that aren't necessarily permutations of each other: any additional character that can't be matched is considered as one additional block.
f("computer", "combuter") = 3 // com uter -> com uter, unmatched: p and b.
Observing that instead of counting blocks we can count the number of pairs of indices that are taken apart by a permutation, we can write f(A,B) formally as:
f(A,B) = min { C(P) | P:|A|?|B|, P is bijective, ?i?dom(P) A[P(i)]=B[P(i)] }
C(P) = |A| + |B| - |dom(P)| - |{ i | i,i+1?dom(P) and P(i)+1=P(i+1) }| - 1
The problem is... guess what...
... that I'm not able to calculate this in polynomial time.
Can someone suggest a way to do this efficiently? Or perhaps point me to already known metric that exhibits similar properties?