Choosing a package from npm

Having a ready-made implementation for anything that is not trivial is very useful and lucky us - the javascript and node.js community - we have npm. However, unfortunately, searching on npm many times returns many packages which are mediocre at best. After having to chose a radix-trie implementation for a customer - I decided to share how I chose packages from npm, and use this last search as a show-case.

The showcase - Radix-Trie algorithm

Radix-Trie is a tree index that helps finding search-terms by prefix. The implementations I’m interested at run in-memory, and traverse the tree quickly to update it, or return search results.

Given the need for a tree - means - a big dataset - the first motive is operations-efficiency. We don’t want to iterate all keys of the index for every search. So we need some sort of a sorted data-structure that can be easily searched on.

We basically want to know on each user key-stroke what possibilities it opens and offer them for selection. This basically is a tree, where every additional user input navigates us down the tree to more speciffic results.

The second motive is memory-efficiency. We would like the data structure to be as compact as possible - i.e - take the least memory as possible - we don’t want nodes in the tree that don’t translate to actual results. If a node leads to exactly one node which may lead to further results - they can be represented as a single node on the tree.

Here’s the most common visual example: /media/radix.png

And some theory text about it.

Chosing Candidates

So, now that we know what we’re after - the next step is to search for implementations on npm. When searching on npm we get results with 3 measures per result: popularity, quality and maintenance.

First thing, you see the idea is not popular. No package is reported with popularity above 5%. I would expect higher popularity. Well, we’re here to fix that. Second thing - you see a lot of results. While I have read and evaluated many of them right before I thought this deserves an article - I’ll redo full evaluation here with you just a few.

On the same glance you get to see links to their open-source repo (which usually is git), the number of open issues, pull requests - and more obvious things. The goal here is to reassure the project is alive - or - stable, preferably both. Once we’re here - worth’s having a glance at the lisence - just to be safe side on the legal aspect.

Next - I went over the readme of all of them, and filtered out all implementations that do not provide a search-by-prefix. Some implement just get/set by key and let you iterate over all of them (Gosh, why would you want to do that - for this a simple Map will do). Some implement get and set, and expose the internals so you may traverse it yourself. But this will not do. I’d rather a battle-tested implementation please.

Some may just not document their API well. If I had few results to chose from - I’d start going over the code, starting with reading the tests. Often the tests are written as specifications, and let you understand what you can do with the library, and how to use it. In rare cases I’ll have to actually read the code already in the scanning to get the info I need.

What got left past this preliminary scanning are candidates for evaluation from which we’ll chose just one.

Evaluating Candiates

I like to evaluate libraries in the node.js shell. That includes server frameworks, db-connectors, http clients, rest clients, queue consumers - you name it.

If you have never done any of the following - you don’t know what power you’re giving up:

> (c = require('redis').createClient({host: process.env.REDIS}).)keys('sess:*', console.log);1
1
['sess:sdlc.3243', 'sess:sdlc.3244', 'sess:sdlc.3250']
> svr = require('http').createServer((q,a) => a.send('OK :)')); svr.listen(3300);1
1
> require('axios').get('http://localhost:3300').then(({status, data}) => console.log(status, data), console.log);1
1
200 OK :)
> app = require('express')().use(require('body-parser').json()).post('/emp/:empId', ({params, body},a) => a.json({params,body})); app.svr = app.listen(3400);1
1
>  require('axios').post('http://localhost:3400/emp/some-id', {bo: 'dy'}).then(({status, data}) => console.log(status, data), console.log);1
1
200 {params: {empId: 'some-id'} } {bo: 'dy'}
> app.svr.close(); svr.close(); c.end();
undefined
>

What? create a new file for this one-liner experiment? Don’t laugh. Sadly, there are too many who would open a new project for that.

If I can do it in a shell - it says a lot about the library. It means it’s simple to use, intuitive, open/transparent, decoupled from configuration implementations and more. The shell lets me look inside in runtime and tickle the modules from within, as much as they are open to. And I love that the node shell records the history for me. For me - the shell is like a sandbox debugger on steroids.

BTW - before node.js I tried things in the browser’s developer tools, right from the console. I would find a CDN distribution and do something in the spirit of:

e = document.creeateElement('script'); e.src = 'the url to the library'; document.body.appendChild(e);

and wait for the module to appear on window. And this is a technique one can still use. (before developer tools, we used to hack the address-bar with JavaScript:alert(document.getElementById('emp-3315')) - the last time I had to do that was not too long ago - debugging an app on a user’s mobile device which disables developer mode).

The point is - shell is powerful - use it like a pro.

Besides - I hate writing benchmark repositories, which in my eyes more often than not are there for bragging or for marketting rather for getting real measures. Sometimes I do fallback to actually writing some files - and whenever I have to -it more often than not implies bad things about the complexity and/or stability of the library than the complexity of the use-case. And sometimes I have to brag too :)

Here I’ll do it using a shell, without a repository. The goal is a summary like this (which is the actual summary):

Summary

The trivial-to-find was mediocre and lame. Too bad that the best implementation is not the most trivial to find.

library API Max Entries avg ms load 200K avg ms search 200K
search-trie sheek 256800 (!) 10360.666 3202.75
radix-trie meh. 860000 11561.333 204
radix-trie-js ok… 2670000 30812.5 814
raddog ok… 1650000 8178.833 752.75

(!) - This one literally exploded. The rest - past the noted quota - required increasing amount of time to add more 10K keys, but did not crash. I stopped the test then because with such performance they were no longer useful.

My obvious winner is raddog. Fast load, fast search, reasonable memory foot-print, and fine API that looks fine. The sole miss is that it’s cursor does not fit ES6 of iterations. However - it predated that - what means it’s also battle-tested.

Method overview

So let’s begin by installing our candidates.

$ mkdir sandbox && cd sandbox && npm init -y
...

$ npm install radix-trie-js radix-trie trie-search raddog
...

+ trie-search@1.2.8
+ radix-trie-js@1.0.5
+ raddog@1.0.5
+ radix-trie@1.0.8
added 6 packages from 4 contributors and audited 23 packages in 4.646s
found 0 vulnerabilities

$

No vulnerabilities for any of them. That’s good news. How nice of npm to check for us?

I have also installed the package uuid which I’ll be using to generate keys, and the package debug which I use to log with a built-in interval from previous log entry:

$ npm install uuid debug

That’s the closest we get to a repository. All the rest is running one-liner commands on the node shell. So lets hop in

$ node
> 

Memory footprint

When testing data-structures, I always test memory efficiency - which is how much data it can hold before it explodes. Lieterally explodes. I’ll write a loop that feeds the data structure more data untill the node.js shell crashes out-of-memory. I’ll make sure to output the size of the datastructure every now and then so we know when it explodes. Since the machine is the same, the node runtime is the same, and the memory limit is the same - the effective variable is the implementation.

For example, benching radix-trie-js looks like:

$ node
> process.env.DEBUG='t';d=require('debug')('t');id = require('uuid').v4; RxT = require('radix-tree-js'); t = new RxT();1
1
> i = 0; while(true) ++i % 10000 == 0 && d('key: ', i) || t.add(id(), 1);
...

Notes:

  • Q: while (true) ? A: Don’t pannic. It’s just two lines in a shell. You can start a new one after it crashes. Besides - the test is to see if/when it does crash.
  • Note: Writing it in one line makes the shell history very useful.
  • Note: It’s not always your fault: struggling to put simple concepts in one line testifies also on library’s complexity. Plus - you get better with time and can handle less sheek libraries in one line.
  • Q: what’s the thing with all the && and || ? A: I hate blocks. You can use blocks if you like.
  • Q: what’s the thing with ;1 in the end? A: I do that when I don’t want the shell to echo the returned value of the last statement. Shell users - how many times did you see this output:
    	Promise {
        <pending>,
        domain:
          Domain {
            domain: null,
            _events: { error: [Function: debugDomainError] },
            _eventsCount: 1,
            _maxListeners: undefined,
            members: [] } }
    >
    

    not that it means anything, and it takes place on the shell window’s buffer - and that’s the least worst. Ever seen inspection output of an HTTP request? so ;1 will replace it with just … 1 :)

Load efficiency

Here I want to load a considerable amount of keys, and measure how long it takes. Then I repeat it a few times to get a representative average.

I generate the keys using the v4 flavor of the uuid package.

Here’s an example of one of the radix-trie-js run:

setup line:

> process.env.DEBUG='t';d=require('debug')('t');id=require('uuid').v4;RadixTrieJs=require('radix-trie-js');t = new RadixTrieJs(''); 1
  1

then the test:

> s = Date.now(); i = 0; while (++i <= 200000) i % 10000 == 0 && d('key: ', i) || t.add(id(), 1); Date.now() - s
t key:  10000 +0ms
t key:  20000 +1s
t key:  30000 +1s
t key:  40000 +1s
t key:  50000 +1s
t key:  60000 +1s
t key:  70000 +1s
t key:  80000 +1s
t key:  90000 +1s
t key:  100000 +2s
t key:  110000 +2s
t key:  120000 +1s
t key:  130000 +1s
t key:  140000 +1s
t key:  150000 +2s
t key:  160000 +2s
t key:  170000 +1s
t key:  180000 +1s
t key:  190000 +1s
t key:  200000 +1s
27623
>

Lookup efficiency

On the lookup I had few considerations. First - I wanted to eliminate biased results for an unbalanced tree. I mean, suppose I search for a as a prefix, and by any peculiar chance most keys fell on other 15 characters - I’d get skewed results.

Now since we use uuids as keys - it means they will start with one of the 16 characters that UUID digits are made of: 0-9 and a-f.

So I decided to search once per such hex digit.

The setup is a left over from the previous test - just load a trie with 200K :)

Ten the test. After typing it for the first time - I just pres up and enter few times…

> s = Date.now(); '0123456789abcdef'.split('').forEach( c => Array.from(t.fuzzyGet(c)) ); Date.now() - s
1183
> s = Date.now(); '0123456789abcdef'.split('').forEach( c => Array.from(t.fuzzyGet(c)) ); Date.now() - s
627
> s = Date.now(); '0123456789abcdef'.split('').forEach( c => Array.from(t.fuzzyGet(c)) ); Date.now() - s
632

Leaks

Leaks and memory issues is a rather annoying factor to test. First - relaying on a vm - there are many mechanisms between you and the memory. Second it takes time for garbage-collection to respond.

The gist of the secret sauce are:

  • work with huge data-sets
  • be patient for the system to respond, and hope to see differences
  • repeat the experiment many times.
  • repeat the experiment many times.
  • repeat the experiment MANY MANY MANY times.

Actually, this measure justifies it’s own blog-post, and I hope to get to it some day.

I’ll excuse myself in this post from passing here all the details, and say that I did not detect memory leaks in any of the implementations - including those that declare an inner cache.

But to not leave you empty handed - here’s the overview.

I tried to measure leaks in two ways.

First - load a trie with a considerable amount of data - say, 200K keys, and then discard the reference to the root node.

The second is measured by loading a trie with a considerable amount of data, and deleting the nodes one by one until no nodes are left on the trie.

In both cases -I collect memory consumotion information from a couple of sources.

  • nodejs internal API: process.memoryUsage()
  • os report (ps in mac/linux or process monitor on windows).

Run many tests, collect the data - and there you go.

Details

package: search-trie

  • high premise, but very disappointing.
  • no vulnerabilities
  • API:
    • creation: ctor(keyFields, options)
    • add: t.add(item)
    • search: t.get(prefix)
    • sheek. If all APIs were written like that. Too bad the implementation under just sucks.
  • measures:
    • max entries: 256800 (ran just once)
    • load 200K entries (ms): 8469, 11591, 11070, 10549, 9825, 10660
    • 16 searches on 200K (ms): 3365, 2819, 2496, 4131 ( stopped here… its way too much and nothing is going to save it)
  • codes:
    • setup line:
          process.env.DEBUG='t';d=require('debug')('t');id=require('uuid').v4;TrieSearch=require('trie-search');t = new TrieSearch(['k']);1
      
    • load to explosion:
      i=0; while(true) ++i % 10000 == 0 && d('keys: ', i) || t.addWord(id(), 1)
      
    • load 200K keys:
      s = Date.now(); i = 0; while(++i <= 200000) i % 10000 == 0 && d('key: ', i) || t.addWord(id(), 1); Date.now() - s
      
    • search:
        s = Date.now(); '0123456789abcdef'.split('').forEach( c => t.autocomplete(c) ); Date.now() - s
      

package: radix-trie

  • the best name taken but for a mediocre implementation
  • no vulnerabilities
  • api:
    • creation: ctor(locale)
    • add: t.addWord(k, v)
    • remove: t.removeWord(k)
    • search: t.autocomplete(substr)
    • I hate when they do that - adding unecessary wordings to API names. If it had any other thing to add/remove - I might forgive. Why poor designers get to have the most searchable name?
  • measures
    • max entries: 860000 (ran just once)
    • load 200K entries (ms): 10871, 11006, 11198, 11467, 13220, 11606
    • 16 searches on 200K (ms): 203, 179, 260, 184, 194
  • test codes:
    • setup line:
        process.env.DEBUG='t';d=require('debug')('t'); id=require('uuid').v4;RadixTrie=require('radix-trie');t=new RadixTrie(''); RadixTrie + ""
      
    • load to explosion:
        i=0; while(true) ++i % 10000 == 0 && d('keys: ', i) || t.addWord(id(), 1)   
      
    • load 200K keys:
        s = Date.now(); i = 0; while(++i <= 200000) i % 10000 == 0 && d('key: ', i) || t.addWord(id(), 1); Date.now() - s
      
    • search:
        s = Date.now(); '0123456789abcdef'.split('').forEach( c => t.autocomplete(c) ); Date.now() - s
      

package: radix-trie-js

  • clearly they saw radix-trie and thought - sure I can do better. And so they did.
  • no vulnerabilities
  • api:
    • creation: ctor(locale)
    • add: t.add(k, v)
    • remove: t.delete(k)
    • search: t.fuzzyGet(substr) - search is case insensitive
    • fuzzyGet? really?
  • measures
    • max entries: 2670000, after which it got too slow to be considered functional.
    • load 200 keys (ms) : 27623, 33384, 29769, 31741, 31907, 30451
    • 16 searches on 200K keys (ms): 1183, 627, 632, 653, 629, 648
  • test codes:
    • setup line:
       process.env.DEBUG='t';d=require('debug')('t');id=require('uuid').v4;RadixTrieJs=require('radix-trie-js');t = new RadixTrieJs(''); RadixTrieJs + ""
      
    • load to explosion:
       i = 0; while(true) ++i % 10000 == 0 && d('key: ', i) || t.add(id(), {});
      
    • load 200K keys:
       s = Date.now(); i = 0; while(++i <= 200000) i % 10000 == 0 && d('key: ', i) || t.add(id(), 1); Date.now() - s
      
    • search:
       s = Date.now(); '0123456789abcdef'.split('').forEach( c => Array.from(t.fuzzyGet(c)) ); Date.now() - s
      

package: raddog

  • top dog.
  • api
    • creation: ctor(uiData, title)
    • add: t.insert(item)
    • remove: t.delete(item)
    • search: t.search(prefix, filter) => cursor - cursor is not compatible with of syntax
    • well. their first version predated ES6. I also like better what they have over ES6 iterable. Pitty it does not fit for Array.from… I had to make my own (it’s in the setup line).
  • measures:
    • max entries: 1650000, after which it got too slow to be considered functional.
    • load 200K keys (ms): 8271, 8670, 8986, 7107, 8381, 7658
    • 16 searches on 200K (ms): 938, 690, 780, 603, 615, 678
  • test codes:
    • setup line:
      process.env.DEBUG='t';d=require('debug')('t');id=require('uuid').v4;Raddog=require('raddog');t = new Raddog('k','k'); from = c => {const a=[]; while(!c.end) a.push(c.next());};1
      
      
    • load to explosion:
         i = 0; while(++i) i % 10000 == 0 && d('key: ', i) || t.insert({k: id()});
      
    • load 200K: 8271, 8670, 8986, 7107, 8381, 7658
         s = Date.now(); i = 0; while(++i <= 200000) i % 10000 == 0 && d('key: ', i) || t.insert({k: id()}); Date.now() - s
      
    • search:
         s = Date.now(); '0123456789abcdef'.split('').forEach( c => from(t.search(c)) ); Date.now() - s
      
Thank you for your interest!

We will contact you as soon as possible.

Send us a message

Oops, something went wrong
Please try again or contact us by email at info@tikalk.com