How to calculate this string-dissimilarity function efficiently?

Posted by ybungalobill on Stack Overflow See other posts from Stack Overflow or by ybungalobill
Published on 2011-02-06T15:23:39Z Indexed on 2011/02/06 15:25 UTC
Read the original article Hit count: 285

Hello,

I was looking for a string metric that have the property that moving around large blocks in a string won't affect the distance so much. So "helloworld" is close to "worldhello". Obviously Levenshtein distance and Longest common subsequence don't fulfill this requirement. Using Jaccard distance on the set of n-grams gives good results but has other drawbacks (it's a pseudometric and higher n results in higher penalty for changing single character).

[original research]

As I thought about it, what I'm looking for is a function f(A,B) such that f(A,B)+1 equals the minimum number of blocks that one have to divide A into (A1 ... An), apply a permutation on the blocks and get B:

f("hello", "hello") = 0
f("helloworld", "worldhello") = 1   // hello world -> world hello
f("abba", "baba") = 2               // ab b a -> b ab a
f("computer", "copmuter") = 3       // co m p uter -> co p m uter

This can be extended for A and B that aren't necessarily permutations of each other: any additional character that can't be matched is considered as one additional block.

f("computer", "combuter") = 3       // com uter -> com uter, unmatched: p and b.

Observing that instead of counting blocks we can count the number of pairs of indices that are taken apart by a permutation, we can write f(A,B) formally as:

f(A,B) = min { C(P) | P:|A|?|B|, P is bijective, ?i?dom(P) A[P(i)]=B[P(i)] }
C(P) = |A| + |B| - |dom(P)| - |{ i | i,i+1?dom(P) and P(i)+1=P(i+1) }| - 1

The problem is... guess what...

... that I'm not able to calculate this in polynomial time.

Can someone suggest a way to do this efficiently? Or perhaps point me to already known metric that exhibits similar properties?

© Stack Overflow or respective owner

Related posts about algorithm

Related posts about metrics