A benchmarking tool for Streaming systems

Yahoo! did a benchmark tool to compare different open source stream processing systems. They open-sourceed it in github for anyone to use in their own environment. github/yahoo/streaming-benchmark Currently this benchmark support three Streaming systems: - Apache Storm - Apache Flink - Apache Spark

Storm and Flink are both real-time streaming Open Source systems. Spark Streaming is not real-time but rather micro-batch based.

Both Apache Storm and Spark Streaming became very popular in recent Streaming projects. At Yahoo!, Storm has been chosen as the main platform for many projects. Therefore it is interesting to see what are their findings in this benchmark.

Results of comparison done by Yahoo

The benchmark compares a pipeline that is implemented in the different streaming systems. The pipeline reads events from Kafka, deserialize the JSON events, and write the events to Redis. Additional processing is also done on the events. tumblr_inline_nzet5po95h1spdvr2_540.png

Results can be found in this article: Benchmarking Streaming Computation Engines at Yahoo!.

Quoting bottom line of Yahoo finding:

Percentage comparison of Last Window Update Time

"The throughput vs latency graph for the various systems is maybe the most revealing, as it summarizes our findings with this benchmark. Flink and Storm have very similar performance, and Spark Streaming, while it has much higher latency, is expected to be able to handle much higher throughput."

Conclusion

Storm was very popular starting 4 years ago as it was the first real Open source infrastructure for streaming that is also scalable. Spark Streaming brought new hopes as a better system to support high throughputs.

So if you are looking for a very fast real-time system - then Storm may be your choice. If you are looking for a system that can hold high throughput, but can compromise slight latency - then you may prefer Spark Streaming.