Efficient Trie implementation for unicode strings
- by U Mad
I have been looking for an efficient String trie implementation. Mostly I have found code like this:
Referential implementation in Java (per wikipedia)
I dislike these implementations for mostly two reasons:
They support only 256 ASCII characters. I need to cover things like cyrillic.
They are extremely memory inefficient.
Each node contains an array of 256 references, which is 4096 bytes on a
64 bit machine in Java. Each of these nodes can have up to 256
subnodes with 4096 bytes of references each. So a full Trie for
every ASCII 2 character string would require a bit over 1MB. Three character strings? 256MB just for arrays in nodes. And so on.
Of course I don't intend to have all of 16 million three character strings in my Trie, so a lot of space is just wasted. Most of these arrays are just null references as their capacity far exceeds the actual number of inserted keys. And if I add unicode, the arrays get even larger (char has 64k values instead of 256 in Java).
Is there any hope of making an efficient trie for strings?
I have considered a couple of improvements over these types of implementations:
Instead of using array of references, I could use an array of primitive integer type, which indexes into an array of references to nodes whose size is close to the number of actual nodes.
I could break strings into 4 bit parts which would allow for node arrays of size 16 at the cost of a deeper tree.