Auto-Suggest via ‘Trie’ (Pre-fix Tree)
Posted
by Strenium
on Geeks with Blogs
See other posts from Geeks with Blogs
or by Strenium
Published on Thu, 23 Aug 2012 22:19:25 GMT
Indexed on
2012/08/27
21:40 UTC
Read the original article
Hit count: 750
Auto-Suggest (Auto-Complete) “thing” has been around for a few years. Here’s my little snippet on the subject.
For one of my projects, I had to deal with a non-trivial set of items to be pulled via auto-suggest used by multiple concurrent users. Simple, dumb iteration through a list in local cache or back-end access didn’t quite cut it.
Enter a nifty little structure, perfectly suited for storing and matching verbal data: “Trie” (http://tinyurl.com/db56g) also known as a Pre-fix Tree:
“Unlike a binary search tree, no node in the tree stores the key associated with that node; instead, its position in the tree defines the key with which it is associated. All the descendants of a node have a common prefix of the string associated with that node, and the root is associated with the empty string. Values are normally not associated with every node, only with leaves and some inner nodes that correspond to keys of interest.”
This is a very scalable, performing structure. Though, as usual, something ‘fast’ comes at a cost of ‘size’; fortunately RAM is more plentiful today so I can live with that.
I won’t bore you with the detailed algorithmic performance here - Google can do a better job of such.
So, here’s C# implementation of all this. Let’s start with individual node:
- /// <summary>
- /// Contains datum of a single trie node.
- /// </summary>
- public class AutoSuggestTrieNode
- {
- public char Value { get; set; }
- /// <summary>
- /// Gets a value indicating whether this instance is leaf node.
- /// </summary>
- /// <value>
- /// <c>true</c> if this instance is leaf node; otherwise, a prefix node <c>false</c>.
- /// </value>
- public bool IsLeafNode { get; private set; }
- public List<AutoSuggestTrieNode> DescendantNodes { get; private set; }
- /// <summary>
- /// Initializes a new instance of the <see cref="AutoSuggestTrieNode"/> class.
- /// </summary>
- /// <param name="value">The phonetic value.</param>
- /// <param name="isLeafNode">if set to <c>true</c> [is leaf node].</param>
- public AutoSuggestTrieNode(char value = ' ', bool isLeafNode = false)
- {
- Value = value;
- IsLeafNode = isLeafNode;
- DescendantNodes = new List<AutoSuggestTrieNode>();
- }
- /// <summary>
- /// Gets the descendants of the pre-fix node, if any.
- /// </summary>
- /// <param name="descendantValue">The descendant value.</param>
- /// <returns></returns>
- public AutoSuggestTrieNode GetDescendant(char descendantValue)
- {
- return DescendantNodes.FirstOrDefault(descendant => descendant.Value == descendantValue);
- }
- }
Quite self-explanatory, imho. A node is either a “Pre-fix” or a “Leaf” node. “Leaf” contains the full “word”, while the “Pre-fix” nodes act as indices used for matching the results.
Ok, now the Trie:
- /// <summary>
- /// Contains structure and functionality of an AutoSuggest Trie (Pre-fix Tree)
- /// </summary>
- public class AutoSuggestTrie
- {
- private readonly AutoSuggestTrieNode _root = new AutoSuggestTrieNode();
- /// <summary>
- /// Adds the word to the trie by breaking it up to pre-fix nodes + leaf node.
- /// </summary>
- /// <param name="word">Phonetic value.</param>
- public void AddWord(string word)
- {
- var currentNode = _root;
- word = word.Trim().ToLower();
- for (int i = 0; i < word.Length; i++)
- {
- var child = currentNode.GetDescendant(word[i]);
- if (child == null) /* this character hasn't yet been indexed in the trie */
- {
- var newNode = new AutoSuggestTrieNode(word[i], word.Count() - 1 == i);
- currentNode.DescendantNodes.Add(newNode);
- currentNode = newNode;
- }
- else
- currentNode = child; /* this character is already indexed, move down the trie */
- }
- }
- /// <summary>
- /// Gets the suggested matches.
- /// </summary>
- /// <param name="word">The phonetic search value.</param>
- /// <returns></returns>
- public List<string> GetSuggestedMatches(string word)
- {
- var currentNode = _root;
- word = word.Trim().ToLower();
- var indexedNodesValues = new StringBuilder();
- var resultBag = new ConcurrentBag<string>();
- for (int i = 0; i < word.Trim().Length; i++) /* traverse the trie collecting closest indexed parent (parent can't be leaf, obviously) */
- {
- var child = currentNode.GetDescendant(word[i]);
- if (child == null || word.Count() - 1 == i)
- break; /* done looking, the rest of the characters aren't indexed in the trie */
- indexedNodesValues.Append(word[i]);
- currentNode = child;
- }
- Action<AutoSuggestTrieNode, string> collectAllMatches = null;
- collectAllMatches = (node, aggregatedValue) => /* traverse the trie collecting matching leafNodes (i.e. "full words") */
- {
- if (node.IsLeafNode) /* full word */
- resultBag.Add(aggregatedValue); /* thread-safe write */
- Parallel.ForEach(node.DescendantNodes, descendandNode => /* asynchronous recursive traversal */
- {
- collectAllMatches(descendandNode, String.Format("{0}{1}", aggregatedValue, descendandNode.Value));
- });
- };
- collectAllMatches(currentNode, indexedNodesValues.ToString());
- return resultBag.OrderBy(o => o).ToList();
- }
- /// <summary>
- /// Gets the total words (leafs) in the trie. Recursive traversal.
- /// </summary>
- public int TotalWords
- {
- get
- {
- int runningCount = 0;
- Action<AutoSuggestTrieNode> traverseAllDecendants = null;
- traverseAllDecendants = n => { runningCount += n.DescendantNodes.Count(o => o.IsLeafNode); n.DescendantNodes.ForEach(traverseAllDecendants); };
- traverseAllDecendants(this._root);
- return runningCount;
- }
- }
- }
Matching operations and Inserts involve traversing the nodes before the right “spot” is found. Inserts need be synchronous since ordering of data matters here.
However, matching can be done in parallel traversal using recursion (line 64).
Here’s sample usage:
- [TestMethod]
- public void AutoSuggestTest()
- {
- var autoSuggestCache = new AutoSuggestTrie();
- var testInput = @"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer nec odio. Praesent libero.
- Sed cursus ante dapibus diam. Sed nisi. Nulla quis sem at nibh elementum imperdiet. Duis sagittis ipsum. Praesent mauris.
- Fusce nec tellus sed augue semper porta. Mauris massa. Vestibulum lacinia arcu eget nulla. Class aptent taciti sociosqu ad
- litora torquent per conubia nostra, per inceptos himenaeos. Curabitur sodales ligula in libero. Sed dignissim lacinia nunc.
- Curabitur tortor. Pellentesque nibh. Aenean quam. In scelerisque sem at dolor. Maecenas mattis. Sed convallis tristique sem.
- Proin ut ligula vel nunc egestas porttitor. Morbi lectus risus, iaculis vel, suscipit quis, luctus non, massa. Fusce ac
- turpis quis ligula lacinia aliquet. Mauris ipsum. Nulla metus metus, ullamcorper vel, tincidunt sed, euismod in, nibh. Quisque
- volutpat condimentum velit. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Nam
- nec ante. Sed lacinia, urna non tincidunt mattis, tortor neque adipiscing diam, a cursus ipsum ante quis turpis. Nulla
- facilisi. Ut fringilla. Suspendisse potenti. Nunc feugiat mi a tellus consequat imperdiet. Vestibulum sapien. Proin quam. Etiam
- ultrices. Suspendisse in justo eu magna luctus suscipit. Sed lectus. Integer euismod lacus luctus magna. Quisque cursus, metus
- vitae pharetra auctor, sem massa mattis sem, at interdum magna augue eget diam. Vestibulum ante ipsum primis in faucibus orci
- luctus et ultrices posuere cubilia Curae; Morbi lacinia molestie dui. Praesent blandit dolor. Sed non quam. In vel mi sit amet
- augue congue elementum. Morbi in ipsum sit amet pede facilisis laoreet. Donec lacus nunc, viverra nec.";
- testInput.Split(' ').ToList().ForEach(word => autoSuggestCache.AddWord(word));
- var testMatches = autoSuggestCache.GetSuggestedMatches("le");
- }
..and the result:
That’s it!
© Geeks with Blogs or respective owner