Efficient Trie implementation for unicode strings
Posted
by
U Mad
on Programmers
See other posts from Programmers
or by U Mad
Published on 2012-07-05T11:25:42Z
Indexed on
2012/07/05
15:22 UTC
Read the original article
Hit count: 326
how-to
I have been looking for an efficient String trie implementation. Mostly I have found code like this:
Referential implementation in Java (per wikipedia)
I dislike these implementations for mostly two reasons:
- They support only 256 ASCII characters. I need to cover things like cyrillic.
- They are extremely memory inefficient.
Each node contains an array of 256 references, which is 4096 bytes on a 64 bit machine in Java. Each of these nodes can have up to 256 subnodes with 4096 bytes of references each. So a full Trie for every ASCII 2 character string would require a bit over 1MB. Three character strings? 256MB just for arrays in nodes. And so on.
Of course I don't intend to have all of 16 million three character strings in my Trie, so a lot of space is just wasted. Most of these arrays are just null references as their capacity far exceeds the actual number of inserted keys. And if I add unicode, the arrays get even larger (char has 64k values instead of 256 in Java).
Is there any hope of making an efficient trie for strings? I have considered a couple of improvements over these types of implementations:
- Instead of using array of references, I could use an array of primitive integer type, which indexes into an array of references to nodes whose size is close to the number of actual nodes.
- I could break strings into 4 bit parts which would allow for node arrays of size 16 at the cost of a deeper tree.
© Programmers or respective owner