Parallel Batch Job in our application using grid computing and SpringBatch
Just wanted to share with you one of the design issues Shalom and myself worked on DT – How to parallel Batch Job in our application ?
In DT have been using SpringBatch as our framework for executing and managing batch jobs, and we are quite happy with it. However, SpringBatch in its current version lacks the support for parallelism - Whenever a job is been executed, all the work is been done through one thread for all its steps. SpringBatch 2.0 brings the support for parallel jobs, through “split and aggregate” (SpringBatch jargon for map/reduce) the Step.
Nevertheless, SpringBatch is **NOT** a Grid competing framework – It does **NOT** support many of the complected issues for parallelism like node discovery, node topology, job collision, fail-over, node management, job stealing and many other issues regard to grid computing.
We decided to take 2 action:
Use the SpringBatch 2.0 interfaces by leveraging to SpringBatch 2.0 M3.
Implement the interfaces in our application by delegating the split and reduce api from SpringBatch to GridGain.
Now the infrastructure became powered by GridGain ( http://gridgain.com/ ) to support parallelism for SpringBatch jobs, and other tasks like Match&Merge, crawling etc.
GridGain will boost our system scalability, and enable us to run tasks in parallel, coordinate and manage them (through JMX). This parallelism is achieved using the map/reduce paradigm (at heart of Gridgain), and wrapped with the SpringBatch partition interfaces package. Thus, SpringBatch split the step to sub steps and aggregated at the end, but under the cover of the SpringBatch delegates the work to GridGain. GridGain will map GridGain jobs to the available nodes and reduce the result at the end, delegating it back to SpringBatch to be the step outcome.
What does it mean ? It means we can execute jobs/tasks/calculations easily in multi JVM (both on the same computer or on different computers) or in the same JVM in many environment - JUnit, CommandLine, JBoss, Integration-Test. The configuration of the parallelism is done using Spring - This includes discovery of other nodes, topology of the nodes, collision configuration, job stealing , monitoring trough JMX and many many more features GridGain provide with many service provider interfaces already supplied by the framework.
The introduction of GridGain will enable us control many aspect of parallelism (number of sub threads, number of processes, collision detection etc) in the POC and achieve best fine tuning of the infrastructure, again, without changing the code. It will help us to scale CPU bound tasks (like Match&Merge) and IO bound tasks (like email refinery) in common and convenient way.