finding long repeated substrings in a massive string

Posted by Will on Stack Overflow See other posts from Stack Overflow or by Will
Published on 2008-12-29T21:56:55Z Indexed on 2010/05/07 11:48 UTC
Read the original article Hit count: 186

Filed under:
|
|
|

I naively imagined that I could build a suffix trie where I keep a visit-count for each node, and then the deepest nodes with counts greater than one are the result set I'm looking for.

I have a really really long string (hundreds of megabytes). I have about 1 GB of RAM.

This is why building a suffix trie with counting data is too inefficient space-wise to work for me. To quote Wikipedia's Suffix tree:

storing a string's suffix tree typically requires significantly more space than storing the string itself.

The large amount of information in each edge and node makes the suffix tree very expensive, consuming about ten to twenty times the memory size of the source text in good implementations. The suffix array reduces this requirement to a factor of four, and researchers have continued to find smaller indexing structures.

And that was wikipedia's comments on the tree, not trie.

How can I find long repeated sequences in such a large amount of data, and in a reasonable amount of time (e.g. less than an hour on a modern desktop machine)?

(Some wikipedia links to avoid people posting them as the 'answer': Algorithms on strings and especially Longest repeated substring problem ;-) )

© Stack Overflow or respective owner

Related posts about Performance

Related posts about algorithm