Using Radix Tree for Auto-Completion in Node.js

TLDR

Radix-Trie is one of the coolest implementations behind auto-completes (which is an example of by-prefix search). You’ll want to consider using one for your by-prefix searches - client side or server side. Alas - there are tone of midiocre implementations. Here’s how I chose one for my customer.

The pseudo cool intro

Isn’t autocompletion awsome? You get a search field, you give it a few key stroaks and - bam! - autosuggestions for whatever you’re looking for. After all - that’s what Google-Search tought us since birth.

Yes, I know that when I was born there was no internet and computers required their own buildings, but can you please give me a break and let me consider my self young for one second? thanks… ..but actually - that’s a part of the gap. Old guys design for old patterns. Back in the days you had employees, departments and SKUs (goo’ld northwind) - so you had a function find-employee, and had a function find department, and find SKU, where today there’s just find. After all - how many personal names sound like SKUs? It will figure it out in less keystroaks & mouse clicks than you would need to just enter the find person menu. (wich often was in the spirit of data -> entities -> employee -> find employee).

Unfortunately, I can’t say these days are over, because I keep seeing these old patterns in new designs. But at least we are in the right direction - I mean as a spiecie…

Problems with the naive implementation

Ah - I get it, so I stick all the searched titles of all my entities into a single index, and rock right? Well, generally yea… kind of… but in detail - it requires just a little more thought.

const ix = {};
exports { 
  set: (key, entity = true)  => ix[key] = entity,
	get: key => ix[key],
	keys,
	values: prefix => keys(prefix).map( key => ix[key]),
	pairs: prefix => keys(prefix).map( key => {key, value: ix[key] )
};

function keys(prefix) {
		 return Object.keys(ix).filter( key => key.startsWith(prefix) )
}

Lets please keep things simple in this stage. I don’t want to get to discussing iterable generators or ES6 Maps and Sets. There are far more important things before that. So:

First - this index is intollerant for typeos. While you may consider .toLowerCase() your keys to make it case insensitive (and if you do - please do it once on set, and once per search, and not once per search loop iteration) - you will still need to work harder for guessing typing errors.

Second - you’ll have performance issues exonentially proportional to the order of magnitude of your data-set size.

But if your data-set size is small enough - then you don’t have a problem. Actually, if you have a small data-set - you might consider not to do it on the server at all - just give the entire data set to the client and let it handle this index on the browser - there aint no smoother experience than this.

The consideration is to keep download times short, with my rule-of-thumb is around 100M - which is enough space for a lot of data, but what counts is the satisfaction of the IX guys who know what is reasonable to ask a client to wait (that is wait after the bloody framework done loading… even if you load it lazily in the background :P).

Personally - whenever the index-set is in magnitude of n * 10^4 I usually recommend sending the data-set to the client, and insist it the more static the data is. A magnitude above n * 10^3 is where I’ll consider trading the lightweight 10-code-lines naive implementation with a more performant library - even on the client.

Anyway, more often than not - I just send the whole dataset as is:

[{
   id: 342134
   fname: "Edward",
	 lname: "Allen",
	 yearOfBirth: 1873,
	 //... whatever else attributes 
}, 
 //...whatever more entries
]

Sometimes - I take a few steps to optimized for the network and send it like so:

[ 
   [342134, "Edward", "Allen", 1873 ... ]
	 //...whatever more entries
]

this, while having a constructor/factory that populates fields to actual property name, so as data-structure on the client - you have objects with actual properties - just like in the previous example.

One more considerable tip here - is to use a go’old JSONP call and format the output as Object Literals passed to a callback, rather than returning a valid JSON string for JSON.parse(response.text). This will save you quite a few unecessary megabytes array null elements of null or missing attributes - which are unecessary when writing Object-Literals (i.e JSON: [5, "Albert", null, null, null, 84] vs Object Literal: [5, "Albert",,,84]).

But some times you can’t do it client side no matter how compact is your protocol because your data set is

  • either too big (e.g. - list of all cities, towns, states, airports, and train stations in …wherever)
  • or constantly changing (e.g. list of auction bids and bidders’ balance).
  • or because it is too sensitive to pack and deliver as one source.

Any of the constraints above will force you to use a server-side implementation.

If it’s just size - you might have a last shot providing the client with just a search index that will end with a key by which full entity may be retrieved via a separate API. But if the whole thing is just too damn big or too do dynamic where entries constantly opt in and out - you just have to do it on the server.

If you decided to do it on the server and you’re considering the naive implementation - you may also consider a change of career. Starving the event loop of the server is a crime of a different severity than starving a browser’s event loop.

Enter Radix-Trie

We’re not going to implement a radix-trie here. I mean it’s a cool excersize, and all - but there are tones of Radix-Trie implementations on npm. But you do need to understand the idea.

The first motive is operations-efficiency. We don’t want to iterate all keys of the index for every search. So we need some sort of a sorted data-structure that can be easily searched on. We basically want to know on each key-stroak what possibilities it opens and offer them to the user. This basically is a tree, where every user input navigates us down the tree to more speciffic results.

The second motive is memory-efficiency. We would like the data structure to be as compact as possible - i.e - take the least memory as possible - we don’t want nodes in the tree that don’t translate to actual results. If a node leads to exactly one node which may lead to further results - they can be represented as a single node on the tree.

Here’s the most common visual example: /media/radix.png

And some theory text about it.

Enter Radix-Trie on npm

The part that convinced me that an article is needed is that there are tones of implementations, and most of them actually suck - especially the popular ones, where the good ones are not intuitively searchable.

So, now that we know what we’re after - the next step is to search for implementations on npm, and you get results with 3 measures per result: popularity, quality and maintenance.

First thing, you see the idea is not popular. No package is reported with popularity above 5%. I would expect higher popularity. Well, we’re here to fix that. Second thing - you see a lot of results. While I read and evaluated all of them right before concluding this stuff deserves an article, I’ll redo here with you just a few.

So first - I went over the readme of all of them, and filtered out all implementations that do not provide a search-by-prefix. Some implement just get/set by key and let you iterate over all of them (Gosh, why would you want to do that - for this a simple Map will do). Some implement get and set, and expose the internals so you may traverse it. But this will not do. I’d rather a battle-tested implementation please.

What is left is candidates for evaluation from which we’ll chose just one.

Evaluating candidates

I like to evaluate libraries in the node.js shell. That includes server frameworks, db-connectors, http clients, rest clients, queue consumers - you name it. If I can do it in a shell - it says a lot about the library. Like it means it’s simple to use, intuitive, open/transparent, decoupled from configuration implementations and more. The shell lets me look inside in runtime and tickle the modules from within, as much as they are open to. And I love that the node shell records the history for me.

I hate writing benchmark repositories, which in my eyes more often than not are there for bragging. But sometimes I do fallback to actually writing some files - and whenever I have to -it more often than not implies bad things about the complexity and/or stability of the library than the complexity of the use-case. And sometimes I have to brag too.

Today I’ll do it without a repository. So let’s install our candidates.

$ npm install --no-save radix-trie-js radix-trie trie-search raddog

Note: In my current project I have the package uuid which I’ll be using to generate keys. If you don’t have it - you should install it too.

When testing data-structures, first test memory efficiency - which is how much data it can hold before it explodes. Lieterally explodes. I’ll write a loop that feeds the data structure untill the node.js process crashes out-of-memory. I’ll make sure to output the size of the datastructure every now and then so we know when it explodes. Since the machine is the same, the node runtime is the same, and the memory limit is the same - the effective variable is the implementation.

Lets take for example radix-trie-js

$ node
> id = require('uuid').v4; RxT = require('radix-tree-js'); t = new RxT();1
1
> i = 0; while(true) ++i % 10000 == 0 && console.log('key: ', i) || t.add(id(), 1);

Thank you for your interest!

We will contact you as soon as possible.

Send us a message

Oops, something went wrong
Please try again or contact us by email at info@tikalk.com