Autocomplete server-side implementation
- by toluju
What is a fast and efficient way to implement the server-side component for an autocomplete feature in an html input box?
I am writing a service to autocomplete user queries in our web interface's main search box, and the completions are displayed in an ajax-powered dropdown. The data we are running queries against is simply a large table of concepts our system knows about, which matches roughly with the set of wikipedia page titles. For this service obviously speed is of utmost importance, as responsiveness of the web page is important to the user experience.
The current implementation simply loads all concepts into memory in a sorted set, and performs a simple log(n) lookup on a user keystroke. The tailset is then used to provide additional matches beyond the closest match. The problem with this solution is that it does not scale. It currently is running up against the VM heap space limit (I've set -Xmx2g, which is about the most we can push on our 32 bit machines), and this prevents us from expanding our concept table or adding more functionality. Switching to 64-bit VMs on machines with more memory isn't an immediate option.
I've been hesitant to start working on a disk-based solution as I am concerned that disk seek time will kill performance. Are there possible solutions that will let me scale better, either entirely in memory or with some fast disk-backed implementations?
Edits:
@Gandalf: For our use case it is important the the autocompletion is comprehensive and isn't just extra help for the user. As for what we are completing, it is a list of concept-type pairs. For example, possible entries are [("Microsoft", "Software Company"), ("Jeff Atwood", "Programmer"), ("StackOverflow.com", "Website")]. We are using Lucene for the full search once a user selects an item from the autocomplete list, but I am not yet sure Lucene would work well for the autocomplete itself.
@Glen: No databases are being used here. When I'm talking about a table I just mean the structured representation of my data.
@Jason Day: My original implementation to this problem was to use a Trie, but the memory bloat with that was actually worse than the sorted set due to needing a large number of object references. I'll read on the ternary search trees to see if it could be of use.