Data Processing - Should I Stream or Should I Batch
Data Processing - Streaming Vs. Batch
In today’s big data world, we need to process a lot of data in high volumes. There are some very good frameworks for data processing. Just to name some: Apache Hadoop for batch processing, Apache Storm for data streaming, Apache Spark that can do both batch and streaming, and others.
But many companies that want to move into the big data world and to scale out, start to think: what do I really need: batch or streaming?
The answer, may seem very simple: if you need (near) real-time freshness, then go to Streaming. If latency is not an issue, you may use batch processing.
But from working with a some companies, I realized that this is not that simple. It is not always clear what are the requirements. Moreover, companies should get prepared for a future where the requirements may change.
A new way to look at Data Processing
When I go to work, I like to use the train. I know exactly when I will arrive and I have plenty of time to think during the trip.
Other people prefer to have their car close to them, therefore they do not use the train. Having your own car allow you to hit the road the moment you are ready and get faster, although sometime you would get later to work then if you were using the train…
Now, let us look at this preference not from an individual point of view, but rather from the eye of the minister of transportation. It is quite obvious that the minister of transportation prefers that everyone uses the train instead of private cars. Why? because the infrastructures for the train are much cheaper than those needed for a car for every person. When we use a car we need many roads, gasoline, there are much more pollution and car crashes…
However, it is still important to have roads for cars, as sometimes it is essential to be able to get faster than the train, or drive during times that the train does not work.
Now this is exactly the same with Batch Vs. Stream of data! The batch is like the train: it uses cheaper infrastructure than Data Stream. You can predict much better when the batch will be complete, although it may take more time. When using Streaming, you think you get faster response, but if there is a data spike you may run into “traffic jam” that will cause you to be late!
So what approach for data processing is preferred?
To continue with our parable above, let us say that the software architect is the minister of transportation.
He will always prefer batch processing over streaming when feasible! It is cheaper, it is more predictable and it is easier to manage. Streaming then is luxury, like driving your own car. But sometimes you have to get your data faster than the batch process allows. Then, and only then, you should use Data Streaming!