Auto-Completion Techniques in JavaScript and Node.js

Isn’t autocompletion awsome? You get a search field, you give it a few key strokes and - bam! - autosuggestions for whatever you’re looking for. After all - that’s what Google-Search tought us since birth.

Yes, I know that when I was born there was no internet and computers required their own buildings, but can you please give me a break and let me consider myself to be young for one second? thanks… ..but actually - that’s a part of the gap…

Old guys design for old patterns. You see, back in the days you had employees, departments and SKUs (goo’ld northwind) - so you had a function find-employee, and had a function find department, and find SKU, where today there’s just find. After all - how many personal names sound like SKUs? It will figure it out in less keystrokes & mouse clicks than you would need to just enter the find person menu. (which often was in the spirit of data -> entities -> employee -> find employee, and then you had to fill the form and click search).

Unfortunately, I can’t say these days are over, because I keep seeing these old patterns in new designs. But at least we are in the right direction - I mean as a species… :P

Ah - I get it, so I stick all the searched titles of all my entities into a single index, and rock right? Well, generally yeah… kind of… but in detail - it requires just a little more thought.

Problems with the naive implementation

Consider the following implementation:

const ix = {};
exports { 
  set: (key, entity = true)  => ix[key] = entity,
  get: key => ix[key],
  keys,
  values: prefix => keys(prefix).map( key => ix[key]),
  pairs: prefix => keys(prefix).map( key => {key, value: ix[key] )
};

function keys(prefix) {
    return Object.keys(ix).filter( key => key.startsWith(prefix) )
}

Lets please keep things simple at this stage. I don’t want to discuss iterable generators or ES6 Maps and Sets. There are far more important things before that.

So what are the real problems?

First - this index is intolerant for typos. While you may consider .toLowerCase() your keys to make it case insensitive (and if you do - please do it once on set, and once per search, and not once per search loop iteration) - you will still need to work harder for guessing typing errors.

Second - you’ll have performance issues exponentially proportional to the order of magnitude of your data-set size.

Considering a Client-Side Index

If your data-set size is small enough - then you don’t have a problem - you can use the 10-code-lines naive implementaion and live with it well. No additional effort required. Actually, if you have a small data-set - you might consider doing it all on client-side - just give the entire data set to the client and let it handle this index on the browser - there ain’t no smoother experience than this, and the coding is mostly straight-forward. This model is called a briefcase model, because the client “takes the data” with it wherever it goes “in a briefcase”.

In case the data is dynamic - you may consider either to setup a socket.io connection, and have all connected clients notified upon changes, so they can update their local index or expose an API to poll for changes.

There are considerations about sending the entire data-set.

First, is to keep download times short. My rule-of-thumb is around 10M - which is enough space for a lot of data, but what counts as final word should be the satisfaction of the IX guys who knows what is reasonable to ask a user to wait (that is wait after the bloody framework has done loading… even if you load it lazily in the background :P).

Personally - whenever the index-set is in magnitude of n 10^4 I usually recommend sending the data-set to the client, and I insist even more the more static the data is, and the more likely the user is to search it few times during her stay in our app. BTW - a magnitude of n 10^3 or below is where I won’t even consider trading the lightweight 10-lines naive implementation with a more performant library, especially when the index is on the client.

So in case the data is not that big - I just send the whole dataset as is:

{"employees":[{
   id: 342134
   fname: "Edward",
	 lname: "Allen",
	 yearOfBirth: 1873,
	 //... whatever else attributes 
}, 
 //...whatever more entries
]
//...whatever more entities
}

Optimized traffic

Sometimes - I take a few steps to optimize for the network and send it like so:

{"employees":[ 
   [342134, "Edward", "Allen", 1873 ... ]
	 //...whatever more entries
],
//...whatever more entities
}

this, while having a factory that populates fields to actual property name, so the data-structure on the client gets to have objects with actual property names - just like in the previous example.

Something in the spirit of:

import model from './data/model'
const entityConveter = {
   employees:  ([id, fname, lname, yearOfBirth]) => ({id, fname, lname, yearOfBirth}),
	 //whatever other converters 
}

axios.get(DATA_INDEX_URL)
.then(({data}) => Object.keys(data).forEach(
   entity => model.init(entity,  data[entity].map(entityConveter[entity]))
))

Mind that once data is obtained - we pass it to model.init(...) to add each entity to it’s index. Same method is called when polling for changes, or when an update arrives throught socket.io.

One more tip you may consider here - is to use for the model initiation a go’old JSONP call and format the output as Object Literals passed to a callback, rather than returning a valid JSON string for JSON.parse(response.text). This will save you quite a few unecessary megabytes of array null elements of null or missing optional attributes - which are unecessary when writing Object-Literals (i.e JSON: [5, "Albert", null, null, null, 84] vs Object Literal: [5, "Albert",,,84]). It will require you to implement the serialization, but in many cases it reduced my traffic by tens of percents.

One more optimization you may consider - is providing the client with just a search index that will end with a key by which full entity-data may be retrieved later via a separate API, so you just download the search-index that converts search terms to ids. This means that entries sent to the index contain just entity type, name and id.

Some times you can’t do it client side no matter how compact is your protocol and how optimized your network is. Sometimes your index gets just too big or too do dynamic where entries constantly opt in and out, then you just have to do it on the server.

Server-Side Index

You will have no choice but to do it on a server when your data set introduces one or more of the following constraints:

  • too big (e.g. - list of all cities, towns, states, airports, and train stations in like …the whole world)
  • data size & usage counts as your main business and is served through API throttled and/or billed.
  • too sensitive to pack and deliver as one source (i.e election-area resendents)
  • constantly changing (e.g. list of auction bids and bidders’ balance).

The naive server-side implementation is to keep all the results in some database, and run queries against it. Many in-house solutions for corporate users do just that.

The better ones take it a step further. Oh, no, the days where users press a key and look for the next are over. Today users are born with a keyboard on their fingertips and keep their eyes on the screen, and look for immediate feedback to their key-strokes. Often - a DB query is not fast enough.

Blazing-Fast Server-Side Index

The thing with servers is that they can use a lot of memory and keep the whole index in RAM, without needing to send it anywhere - just hold to it and provide blazing-fast results. In many cases one server is enough, but in some cases the data is too big to reside on one server’s RAM.

First division that comes to mind is dedicate a server just for the index. This means we’ll use a host and all it’s memory to hold just the index, and facilitate access to this index through some sort of a microservice.

A key in the index is a search term, and the value is a simple object describing just entity name, type and id. The data can - once this autocompletion entry is selected by the user - be fetched from the other server that does not have to use all it’s memory for the index. It may even be that behind the scenes different entities are fetched from different backend servers - depending on how data is divided between your microservices. The client still sees a single base URL that points to a reverse proxy like nginx, that translates different paths to different hosts.

Consider the following nginx rules:

#in this example app1, app2 and app3 are similar, and sharing load by providing redundancy
upstream suggestions_upstream {
    server app1.suggestions.local:8080;
    server app2.suggestions.local:8080;
    server app3.suggestions.local:8080;
}

and

location /suggestions {
   rewrite ^/suggestions/(.*) /$1 break;
   proxy_pass suggestions_upstream ;
   proxy_set_header Host $host;
   proxy_set_header X-Forwarded-Host $host;
   proxy_set_header X-Forwarded-Server $host;
   proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
}

When your index is even bigger for just one server - you’ll want to consider sharding of the index itself. Sharding means that your sorted index is divided to relatively equal parts and distributed between a few index servers. You can do it with auto-completion index because the entries in the index are quite independent of each other. Even if same data entity can be the result of many searches - which means that searches ran against different shards may point to the same result entry - it’s not so bad to duplicate the search-index result entry, which contains name, type and id.

Consider the following nginx rules:

#in this example app1, app2 and app3 share load by sharding - they divide data-ownership between them
upstream shard1_suggestions_upstream {
    server app1.suggestions.local:8080;
}
upstream shard2_suggestions_upstream {
    server app2.suggestions.local:8080;
}
upstream shard3_suggestions_upstream {
    server app3.suggestions.local:8080;
}

and

location ~ "^/suggestions(?<command>.+)/[a-l]" {  # -- node for a-h prefixes
			# ...same as in prev. example  - except:
     proxy_pass shard1_suggestions_upstream ;
}

location ~ "^/suggestions(?<command>.+)/[m-z]" { # -- node for i-z prefixes
       #...
     proxy_pass shard2_suggestions_upstream ;
}

location ~ "^/suggestions(?<command>.+)" {    # -- the catchall node
       #...
     proxy_pass shard3_suggestions_upstream ;
}

Optimize the Index Algorithm

Anyway you do it, client or server, sharded or not - you get a model that is initiated on applicaiton start that keeps data indexed and accessible.

If you decided to do it on the server and you’re considering the naive implementation - you may also consider a change of career. Starving the event loop of the server is a crime of a different severity than starving a browser’s event loop.

Enter Radix-Trie

We’re not going to implement a radix-trie here. I mean it’s a cool excersise, and all - but there are tons of Radix-Trie implementations on npm. However - you do need to understand the idea.

The first motive is operations-efficiency. We don’t want to iterate all keys of the index for every search. So we need some sort of a sorted data-structure that can be easily searched on. We basically want to know on each key-stroke what possibilities it opens and offer them to the user. This basically is a tree, where every user input navigates us down the tree to more speciffic results.

The second motive is memory-efficiency. We would like the data structure to be as compact as possible - i.e - take the least memory as possible - we don’t want nodes in the tree that don’t translate to actual results. If a node leads to exactly one node which may lead to further results - they can be represented as a single node on the tree.

Here’s the most common visual example: /media/radix.png

And some theory text about it.

Finding a good implementation on npm

Obviously, a ready made implementation of this algorithm would cut down the implementation to just few lines.

However - unfortunately, searching for radix-trie gave me a lot of search results, many of them were mediocre at best. After a thorough examination I have chose raddog as my implementation of choice. Even though I had to work harder for it’s cursor implementation - it won performance tests big time. Check this article for all the details of how I evaluated my candiadates and why I chose it.

But once chosen - here’s the gist of the implementation:

const {assign} = Object;
const trie = require('raddog')('id', 'name'); // <-- inex entries by their `name` property, identified by 'id'
const MAX_SUGGESTIONS = require('config').get('search.autocomplete.maxSuggestions');
const  from = c => {
     const a=[]; 
		 while(!c.end || a.length == MAX_SUGGESTIONS) a.push(c.next());
		 return a
}

module.exports = {
   init: (type, entry) => trie.insert(assign({type}, entry)),
  suggest: (term) => fom(trie.search(term))
}

So, used in a web route in an express app:

const model = require('./model');

app.get('/suggestions/:term', ({params: {term}}, res) => {
    const suggestions = model.suggest(term);
    res.json( suggestions )
})

Or, on the client side - just use the same model method on your key-up handler or (the corresponding action) and pass the suggestions for rendering.

One last note I have to make - this gist does not consider relevance of the suggested results, but just cuts the first MAX_SUGGESTIONS. If you need to compute relevance to the user - you’ll need to work a little harder. There are many ways to do that, and I might get to that in another post.

Thank you for your interest!

We will contact you as soon as possible.

Send us a message

Oops, something went wrong
Please try again or contact us by email at info@tikalk.com