Shortest Common Superstring: find shortest string that contains all given string fragments
- by occulus
Given some string fragments, I would like to find the shortest possible single string ("output string") that contains all the fragments. Fragments can overlap each other in the output string.
Example:
For the string fragments:
BCDA
AGF
ABC
The following output string contains all fragments, and was made by naive appending:
BCDAAGFABC
However this output string is better (shorter), as it employs overlaps:
ABCDAGF
^
ABC
^
BCDA
^
AGF
I'm looking for algorithms for this problem. It's not absolutely important to find the strictly shortest output string, but the shorter the better. I'm looking for an algorithm better than the obvious naive one that would try appending all permutations of the input fragments and removing overlaps (which would appear to be NP-Complete).
I've started work on a solution and it's proving quite interesting; I'd like to see what other people might come up with. I'll add my work-in-progress to this question in a while.