finding long repeated substrings in a massive string

Posted by Will on Stack Overflow See other posts from Stack Overflow or by Will
Published on 2008-12-29T21:56:55Z Indexed on 2010/05/07 11:48 UTC
Read the original article Hit count: 246

Filed under:

Performance

|

algorithm

|

search

|

string

I naively imagined that I could build a suffix trie where I keep a visit-count for each node, and then the deepest nodes with counts greater than one are the result set I'm looking for.

I have a really really long string (hundreds of megabytes). I have about 1 GB of RAM.

This is why building a suffix trie with counting data is too inefficient space-wise to work for me. To quote Wikipedia's Suffix tree:

storing a string's suffix tree typically requires significantly more space than storing the string itself.

The large amount of information in each edge and node makes the suffix tree very expensive, consuming about ten to twenty times the memory size of the source text in good implementations. The suffix array reduces this requirement to a factor of four, and researchers have continued to find smaller indexing structures.

And that was wikipedia's comments on the tree, not trie.

How can I find long repeated sequences in such a large amount of data, and in a reasonable amount of time (e.g. less than an hour on a modern desktop machine)?

(Some wikipedia links to avoid people posting them as the 'answer': Algorithms on strings and especially Longest repeated substring problem ;-) )

© Stack Overflow or respective owner

Related posts about Performance

Improving VPN performance - stronger encryption = more performance?

as seen on Server Fault - Search for 'Server Fault'
I have a site-to-site VPN set up with two SonicWall's (a TZ170 and a Pro1260). It was suggested to me that turning off encryption (so the VPN is tunneling only) would improve performance. (I'm not concerned with security, because the VPN is running over a trusted line.) Using FTP and HTTP transfers… >>> More
Inaccurate performance counter timer values in Windows Performance Monitor

as seen on Stack Overflow - Search for 'Stack Overflow'
I am implementing instrumentation within an application and have encountered an issue where the value that is displayed in Windows Performance Monitor from a PerformanceCounter is incongruent with the value that is recorded. I am using a Stopwatch to record the duration of a method execution, then… >>> More
Excel-based Performance Reviews transformed into Web Application for Performance Management

as seen on Geeks with Blogs - Search for 'Geeks with Blogs'
HR TMS provides enterprise talent management solutions for healthcare, retail and corporate customers, focusing on performance management, compensation management and succession planning. As the competency of nurses and other healthcare workers is critical, the government, via the Joint Commission… >>> More
How to save a perfmon Performance Counter as a textfile (Reliability and Performance Monitor Version

as seen on Server Fault - Search for 'Server Fault'
Now the file gets saved as blg, but I would like a txt versin to import in Excel. >>> More
SQLAuthority News – A Successful Performance Tuning Seminar at Pune – Dec 4-5, 2010

as seen on SQL Authority - Search for 'SQL Authority'
This is report to my third of very successful seminar event on SQL Server Performance Tuning. SQL Server Performance Tuning Seminar in Colombo was oversubscribed with total of 35 attendees. You can read the details over here SQLAuthority News – SQL Server Performance Optimizations Seminar – Grand… >>> More

Related posts about algorithm

Jpeg Algorithm vs BMP Algorithm?

as seen on Super User - Search for 'Super User'
I'm just wonder, what the differences are between creating a BMP file algorithm and JPG file algorithm ? If you know the others images' format algorithm, please post them. Thanks. >>> More
word disambiguation algorithm (Lesk algorithm)

as seen on Stack Overflow - Search for 'Stack Overflow'
Hii.. Can anybody help me to find an algorithm in Java code to find synonyms of a search word based on the context and I want to implement the algorithm with WordNet database. For example, "I am running a Java program". From the context, I want to find the synonyms for the word "running", but the… >>> More
Search algorithm (with a sort algorithm already implemented)

as seen on Stack Overflow - Search for 'Stack Overflow'
Hello, Im doing a Java application and Im facing some doubts in which concerns performance. I have a PriorityQueue which guarantees me the element removed is the one with greater priority. That PriorityQueue has instances of class Event (which implements Comparable interface). Each Event is associated… >>> More
Is there any algorithm for finding LINES by PIXEL COLORS on picture?

as seen on Stack Overflow - Search for 'Stack Overflow'
So I have Image like this I want to get something like this (I hevent drawn all lines I want but I hope you can get my idea) I need algorithm for finding all straight lines on it by just reading colors of pixels. No hard math, no Haar, no Hough. Some algorithm which would be based on points… >>> More
collsion issues with quadtree [on hold]

as seen on Game Development - Search for 'Game Development'
So i implemented a Quad tree in Java for my 2D game and everything works fine except for when i run my collision detection algorithm, which checks if a object has hit another object and which side it hit.My problem is 80% of the time the collision algorithm works but sometimes the objects just go… >>> More