How to quickly search through a very large list of strings / records on a database
- by Giorgio
I have the following problem: I have a database containing more than 2 million records.
Each record has a string field X and I want to display a list of records for which field X contains a certain string. Each record is about 500 bytes in size.
To make it more concrete: in the GUI of my application I have a text field where I can enter a string. Above the text field I have a table displaying the (first N, e.g. 100) records that match the string in the text field. When I type or delete one character in the text field, the table content must be updated on the fly.
I wonder if there is an efficient way of doing this using appropriate index structures and / or caching. As explained above, I only want to display the first N items that match the query. Therefore, for N small enough, it should not be a big issue loading the matching items from the database. Besides, caching items in main memory can make retrieval faster.
I think the main problem is how to find the matching items quickly, given the pattern string. Can I rely on some DBMS facilities, or do I have to build some in-memory index myself? Any ideas?
EDIT
I have run a first experiment. I have split the records into different text files (at most 200 records per file) and put the files in different directories (I used the content of one data field to determine the directory tree). I end up with about 50000 files in about 40000 directories. I have then run Lucene to index the files. Searching for a string with the Lucene demo program is pretty fast. Splitting and indexing took a few minutes: this is totally acceptable for me because it is a static data set that I want to query.
The next step is to integrate Lucene in the main program and use the hits returned by Lucene to load the relevant records into main memory.