Sequencing without Locks

The usual way to sequence operations on a resource is by using locks. Get a lock before accessing the resource and release it after. In this post I’ll write about a mechanism that avoids (explicit) locking. We’ve used this mechanism at a previous company I’ve worked with to gain performance and stability.


What is wrong with locking? Well, two main things:

1.       The thread that attempts to acquire the lock is blocked until the lock is acquired. This means it cannot do any work and is therefore taking memory resources (stack size) without contributing anything.


2.       Locks may lead to deadlocks. If two locks are acquired by different threads in reverse order, then this is a deadlock waiting to happen.


There are some alternative ways to locking:

1.       Using `tryLock`. In which if the lock cannot be acquired the method returns at once with a false result. Then it is up to the caller to decide what to do. Usually, a reattempt is made. Which means of course that the thread is just redoing what it already did, so it wastes memory as well as CPU. Also, there’s no guarantee that the next attempt will acquire the lock. This method (as well as acquiring with timeout) is mainly good for avoiding deadlocks at the price of more complex calling code.


2.       Using optimistic locking. This method is good mainly when working with databases. The thread reads a resource (a record in the database) which contains a timestamp indicating the last time the resource was changed. When the thread updates the resource, if the timestamp hasn’t changed, it is allowed to carry on the update. Otherwise, it fails. As with the tryLock method, this mainly avoids deadlocks, but wastes time retrying the same logic. It is also not suitable when transactions are not possible. E.g., if modifying external resources that do not support rollback (e.g, rebooting a machine)


3.       Using different concurrency models: The Actor model and Dataflow. These are great models but one changes the whole paradigm of development while the other is mainly suitable for “spreadsheet” like scenarios.



At my previous company we were in a situation where we were using the standard way of accessing shared resources. The problems we were facing is that we had no control over our threads. We needed to create many threads to accommodate requests made to the system because many of them were blocked on locks and could not service those requests. We also had to fight many deadlocks, and we also had a lot of cases where operations didn’t finish on time because threads were blocked which caused timing issues.


The alternative we used was based on SEDA. In the SEDA paradigm, complex workflows are broken into stages. Each stage is a queue of tasks serviced by a (global) thread pool. When each task is processed, it can create further tasks for downstream stages. This model allows to decouple logic from threading (a workflow is processed by several threads until completion) and also helps prioritize different stages by assigning more or less threads to process the queue of that stage.


How did this help us achieve a better sequencing scheme? Well for every resource we had a designated queue (obviously, you can’t do that when you have millions of resources, but we were dealing with thousands). If we needed to modify a resource, we created a task and put it in the queue. The framework guaranteed that only one thread would process tasks of that queue at any given time and hence there couldn’t be two threads accessing the same queue (in this respect, this is a lot like the Actor model)


The interesting part was dealing with tasks that needed to change several resources. You can’t always break such class into discrete subtasks that only modify a single resource, and so we needed a better mechanism. The mechanism we used was this:

o       Each task would be labeled with the resources it was about to modify. E.g. “res1”, “res2”. If a piece of logic first needed to find (query) what resources to modify, then it would be split into two tasks (finding & modifying).

o       The task would be put into all queues representing the resources it meant to modify. In the example above it meant it was put into queues q1 and q2.

o        The thread pool would scan all queues according to order (e.g., lexicographic ordering by name or id)

o        A thread would dequeue a task from a queue and increase a counter in the queue saying it had +1 running tasks. This means no more tasks will be dequeued from it until the counter is decreased.

o        The thread would also remove the label of the matching resource from the task (we actually kept a counter and compared to the number of labels)

o        If the tasks has no more labels, it means it is able to run (it was dequeued from all queues and those queues have no other tasks being processed). The task would then run, modifying resources.

o        If the tasks had more labels, the thread would just forget about it and move on to the next queue. More labels meant it was in other queues.

o        At the end of the tasks, the counter would be decremented from all queues it was in.


This means that when a task ran, it required no locks because the system guaranteed no other task would work on the same resources. It also meant that no threads were blocked. This allowed us to create only several threads  (roughly the number of cpus) that would process thousands of tasks with high throughput.


Here’s an example. Say we have 3 queues: Q1, Q2, Q3. And three tasks T1,T2,T3. Each queue has a counter C. And the scenario is this:

·         First pass:

o   T1 is dequeued from Q1. Q1 has C=1. T1 is still labeled with Q3, so it doesn’t run. So the thread moves to the next queue

o   T2 Is dequeued from Q2. Q2 has C=1. T2 is still labeled with Q3, so it doesn’t run

o   T3 is dequeued from Q3. Q3 has C=1. T3 can run so the thread executes it.

All queues now have count of 1, so nothing else runs (of course in a real system, we expect tasks to be more separate)

·         Second pass:

o   T3 finishes executing, so Q3 has C=0.

o   T1 is dequeued from Q3, C=1 and the task is executed (we used to dequeue from the queues whose tasks have finished instead of full scan)

 Third pass:

o   T1 finished, so Q3 has C=0.

o   T2 is dequeued from Q1 and executed



Note that deadlocks were not possible. For a deadlock to happen, two tasks, A and B needed to be in reverse order in two queues Q and P, so in Q the order would be A,B and in P it would be B,A. But since tasks were pushed simultaneously to all queues (there was  a lock there…), this could never have happened.


Another characteristic was that we could define some concurrency levels on some queues. E.g., a resource could be network bandwidth. In which case, it did not require that only one task would use the network. In this case, tasks could be dequeued from the queue as long as the counter in the queue was lower than some threshold.


Of course locking is not the only thing that can block a thread. IO is another. This was dealt with by doing the IO action (ending the task) and waiting for it to call back the system, which generated a fresh task.   For DB accessed, we actually allowed the threads to be blocked. The reason was that while the thread was blocked, the CPU was used to process the query on the side of the DB (which was on the same machine), so CPU resources were not wasted and we found the system was always fully utilized. It is obvious one could have used the callback mechanism for DB access as well.


One final note is about read write locks. In our case, a task would be treated the same for reading or writing to a resource. But it is not hard to adapt the mechanism to allow dequeing tasks while they are just marked as reading a resource while keeping the counter to prevent dequeuing when a task is running that modifies the resource and not dequeing it if  `read` tasks are running.


That’s it for this post. I hope it was clear enough.