On Disk Substring index

Posted by emeryc on Stack Overflow See other posts from Stack Overflow or by emeryc
Published on 2008-09-09T23:48:26Z Indexed on 2010/05/07 17:18 UTC
Read the original article Hit count: 232

Filed under:
|
|

I have a file (fasta file to be specific) that I would like to index, so that I can quickly locate any substring within the file and then find the location within the original fasta file.

This would be easy to do in many cases, using a Trie or substring array, unfortunately the strings I need to index are 800+ MBs which means that doing them in memory in unacceptable, so I'm looking for a reasonable way to create this index on disk, with minimal memory usage.

(edit for clarification)

I am only interested in the headers of proteins, so for the largest database I'm interested in, this is about 800 MBs of text.

I would like to be able to find an exact substring within O(N) time based on the input string. This must be useable on 32 bit machines as it will be shipped to random people, who are not expected to have 64 bit machines.

I want to be able to index against any word break within a line, to the end of the line (though lines can be several MBs long).

Hopefully this clarifies what is needed and why the current solutions given are not illuminating.

I should also add that this needs to be done from within java, and must be done on client computers on various operating systems, so I can't use any OS Specific solution, and it must be a programatic solution.

© Stack Overflow or respective owner

Related posts about indexing

Related posts about on-disk