Writing a spell checker similar to "did you mean"
- by user888734
I'm hoping to write a spellchecker for search queries in a web application - not unlike Google's "Did you mean?" The algorithm will be loosely based on this: http://catalog.ldc.upenn.edu/LDC2006T13
In short, it generates correction candidates and scores them on how often they appear (along with adjacent words in the search query) in an enormous dataset of known n-grams - Google Web 1T - which contains well over 1 billion 5-grams.
I'm not using the Web 1T dataset, but building my n-gram sets from my own documents - about 200k docs, and I'm estimating tens or hundreds of millions of n-grams will be generated.
This kind of process is pushing the limits of my understanding of basic computing performance - can I simply load my n-grams into memory in a hashtable or dictionary when the app starts? Is the only limiting factor the amount of memory on the machine?
Or am I barking up the wrong tree? Perhaps putting all my n-grams in a graph database with some sort of tree query optimisation? Could that ever be fast enough?