Architecture of a resource management server based on Erlang's OTP Design – building an application server

In this post I will describe an interesting architecture of one of the projects that I once wrote which has a lot in
common with Erlang's OTP Design Principles.

Introduction
 
A common approach to expose a service to the web is to run it under an application/web server like Tomcat/Jboss/other And if there are lots of different services that share a common behavior (such as rate limiting, statistics and more..), it could be a monster server that gathers lots of services or we could have multiple application servers running on different ports which have of course, some common jars but also some duplicate code. OTP Design Principles can be viewed as an umbrella of services and features – a resource management server that can update and upgrade services, verify that the system is healthy and working well, provide statistics about any service/worker, etc..
The update/add/delete approach - What would you do if you wanted to update the data in these services during runtime on production systems? Would you cache the old data while the script creates a new one and replace it? Or would you look up inside the data and change specific records? Every service behaves differently and has its own data so you'll need to implement an update mechanism for every service. OTP design overcomes these problems using the replacement approach, simply by loading a new JVM replacer process that holds the new data version instance or even might load a new code version as well.
Keep alive/healthy and working properly – What if a service is damaged and runs into a loop that exhausts the machine or even worse has a bad code that crashes the whole JVM with all the other services that are in that JVM. With OTP design you can monitor all of the processes and verify not only that the service is available but also verify that it works well. In addition, every service is on a different JVM so if one crashes, it won’t crash the whole system and a new replacer will be launched.
Language selection flexibility - what if you want to write a service in Ruby? OTP design Architecture enables this privilege, writing the service/worker can be done in any language.
Development effort/time – OTP helps you to focus only on the business logic. When writing a new worker you don't have to be concerned about exposing the web service, etc...

 

High level design

 
This design is based on separation of concerns:
  1. Workers – every worker acts as a service, for example: a lucene based service that returns certain results is based on its index. The worker receives its jobs, process them and return the results to the Client-Server Supervisor.
  2. Supervisors can be divided to 2 kinds: 
    1. Client-Server Supervisor– Handles the client requests/responses/continuations, delivers its unique job to every worker and provides a variety of services such as statistics,  rate limiting, asynch and synch requests handling, priority jobs, health check/status, etc...  
    2. Workers Supervisor executer – verifies that every worker works well and is responsible to replace workers as needed. It can have another role which is to download new versions (data or code) and replace these workers in runtime.

 

Diagram #1 demonstrates the communication between these modules and not the flow sequence which will be demonstrated in the PowerPoint below:


 

Diagram 1 – High level architecture

 

 

Jobs request handling - Client-Server supervisor & its Workers
 
Workers
Every worker has a generic behavior (GenericWorker): sign up to its supervisors, create periodically HTTP requests, receive jobs to execute, notify about failures, return the jobs results, work in multi threading mode, stop existing request processing etc..
When writing a new service, the developer mainly focuses on the business logic and usually doesn’t have to worry about the generic behavior.
The worker's HTTP request contains the results of the previous job in its body. The next job to process will be delivered to the worker in its HTTP response.
 
Client-Server Supervisor has 3 different Handlers: the first handles with the workers, second with the client requests and the third handler deals with status/statistics of all the queues/worker/services including with a detailed information about all the workers.
The main functionality of the Client-Server is to map all kinds of client service requests to the appropriate service/worker. This server is stateless and holds all queues in memory, there are 2 queues per service – first is the clientQ that contains all waiting client requests and second is the workerQ which contains all the workers that are waiting to receive jobs for this service. All communications between the client to server and workers to server are based on standard HTTP protocol.
Flow description - when a client request arrives to Client-Server Supervisor, a continuation for this request is initialized. If this service currently has available workers waiting in workersQ  then a new “good” worker will be taken from the queue and the job will be created. Otherwise if there are currently no available workers in the queue (they are all working on previous jobs), then this request will be added to the clientQ and when a worker comes back with its previous request, the server will first send its results to the previous client and then the worker will take the second job from the client, you can see the below PowerPoint that demonstrate the request flow.
 
Workers Supervisor executer main role is to be responsible for the resource management: amount of workers, starting/stopping workers, checking if they work well and replacing them when necessary, downloading the latest data/code versions for each service/worker, prepare a new version and replace the old worker, reverting to an old version and keeping X amount of latest good versions, (Versions can be data or code, referring as update and upgrade a service).
 
 

 

</embed>
 
disadvantages / problems in this approach:
1.      Lots of connections can be created in the server - every worker creates multiple connections with the Client-Server supervisor. in order to avoid creation of so many connections you can use enable keep alive and to reuse connections. In addition you might need to configure the tcp configuration in the machine (http://publib.boulder.ibm.com/infocenter/cmgmt/v8r3m0/index.jsp?topic=/com.ibm.eclient.doc/trs40019.htm, http://smallvoid.com/article/winnt-tcpip-max-limit.html).
2.      This design is delicate regarding timings/clock: timeouts on the server side (continuation) VS client side read/connect timeouts.
The design that a worker is arriving to the Client-Server supervisor and hangs there until he receives a job is problematic because (for example):
let’s say you set 20 seconds read timeout at the worker's client side and a client request arrived after 19 seconds, the worker hanged in Client-Server for 19 seconds. in this situation the server will have 1 second to write the job to this worker. Without dealing with this kind of edge scenarios you can lose requests. One way to deal with this issue is to release a worker if there is not enough time to deliver the job and In the other hand I wouldn’t recommend to have endless(0) readTimeout or connectionTimeout.
3.      It takes time to build such an application server that will work well regarding resource management but once that you wrote the supervisors, the development of new services/workers is very easy and fast.
4.      Latency - the latency is harmed a bit because every bit is being written and read twice. on the server, we read the inputstream, write to the workers outputstream. in the workers side we read it again and when we have results we write it again to the server, the latency is being increased only by few Ms depends on the jobs content length.
5.      Separating the supervisor has advantages but also some disadvantages. Dealing with edge communication scenarios where supervisor needs to make decisions.
 
What are the benefits of this approach?
 
  1. Development time decreases - provides a faster way to develop a new service. There is no point in wrapping every new service with a web/application server and implementing all of the above benefits for every service. The initiative is to make a thin service/worker (just the service core itself). 
  2. Every registered worker automatically receives sub-services from the client-server supervisor.
  3. Ability to make data and code updates, done by the supervisor – workers can be replaced with newer versions.
  4. Rate limiting and resource management operations are given for free for all the services/workers.
  5. Watch over the workers/services and verify that they are up and working well.
  6. Information is exposed by API such as: last failed job time, number of good jobs done, total of failed jobs done, how many client requests are waiting in queue at a certain moment. 
  7. Priority queue mechanism – a distinction in client queue requests between normal and higher priority requests which is done by the client-server supervisor.
  8. Load/kill workers based on statistics – based on daily statistics the supervisor can pre-decide when to load more workers. 
  9. Development language selection flexibility.
  10. Minimize the risk if a crash occurs – worker sign up only once and don’t need to be restarted if one of its supervisor was restarted.

 

Thank you for your interest!

We will contact you as soon as possible.

Send us a message

Oops, something went wrong
Please try again or contact us by email at info@tikalk.com