Efficient Trie implementation for unicode strings

Posted by U Mad on Programmers See other posts from Programmers or by U Mad
Published on 2012-07-05T11:25:42Z Indexed on 2012/07/05 15:22 UTC
Read the original article Hit count: 326

Filed under:

I have been looking for an efficient String trie implementation. Mostly I have found code like this:

Referential implementation in Java (per wikipedia)

I dislike these implementations for mostly two reasons:

  1. They support only 256 ASCII characters. I need to cover things like cyrillic.
  2. They are extremely memory inefficient.

Each node contains an array of 256 references, which is 4096 bytes on a 64 bit machine in Java. Each of these nodes can have up to 256 subnodes with 4096 bytes of references each. So a full Trie for every ASCII 2 character string would require a bit over 1MB. Three character strings? 256MB just for arrays in nodes. And so on.

Of course I don't intend to have all of 16 million three character strings in my Trie, so a lot of space is just wasted. Most of these arrays are just null references as their capacity far exceeds the actual number of inserted keys. And if I add unicode, the arrays get even larger (char has 64k values instead of 256 in Java).

Is there any hope of making an efficient trie for strings? I have considered a couple of improvements over these types of implementations:

  • Instead of using array of references, I could use an array of primitive integer type, which indexes into an array of references to nodes whose size is close to the number of actual nodes.
  • I could break strings into 4 bit parts which would allow for node arrays of size 16 at the cost of a deeper tree.

© Programmers or respective owner

Related posts about how-to