Apache Beam Good Bad and the Ugly
Good Bad and the Ugly
I would like to share with you my experience with apache beam on the latest project that I have worked on.
My project was an ETL from salesforce to bigquery. Since we need to do this with a lot of tables and each table can have a few million entries we decided to go with a big data framework.
The main reason to choose apache beam in our case was since it ran on google dataflow, and we already had the google cloud setup in our company. Also the SAS model was very nice for us, and removed the need for devops on the machines.
So what is apache beam? Apache beam is backed by google dataflow. Google decided to open source the sdk for dataflow, and create a unified programing model than can run on multiple engines (dataflow, spark, flink…). There are a few major viewpoints that are different with apache beam than from other frameworks like spark:
Beam is job / pipeline oriented. This means that the whole flow of the job is calculated before it is run on the cloud. This allows for optimizations, and job oriented service.
The other major difference (which we will not go into much here) is the idea that you do not change code if you want streaming or batch mode. Beam has a windowing scheme that covers all scenarios of the sdk, and the change of the windows is part of the configuration of the pipeline. For more information on this see the-world-beyond-batch-streaming-101 and the-world-beyond-batch-streaming-102.
Due to the sdk, there are nice features that you get. For instance you can have state between steps of your transformations (see stateful-processing), you can write our own aggregators so you don’t have to read the data more than once (see CombineFn).
The small things that count
Source - Sink
The main design in most data processing is you have a source and at the end a sink. In beam this is the inherent design, and if you need more than one input you have sideInput, and for output there is sideOutput. Just beware that it is not simple to combine the inputs and outputs as data, the sdk is more code oriented.
Opposed to spark, there is no driver code that you write. In beam you have pipeline. All code that is written before the run of the pipeline, does not run in the cloud but on the client side. So if you have validations or checks before running the pipeline, that will run on your client code, and only the pipeline itself will run in the cloud.
Options for the pipeline (are part of the runtime of the pipeline) what their own class with built in validation rules.
Google dataflow has a very nice GUI for monitoring your pipeline:
Beam even has a wrapper class composite transform that will group actions in the GUI.
Although the sdk is a very nice abstraction, in practice there are not many implementations for different sources.
Since the DAG is computed before execution, you cannot have any dynamic code for parts of the pipeline. This means that in advance you must know what your source and sink are. You cannot save to different folders in google storage based on the input. If you are using Avro, your scheme must known in advance and you cannot use a dynamic generated scheme are you run the pipeline.
When working with CSV’s that are split into parts, only the first part has the header file, so in the code you can get part’s with the header and parts without.
Beam has fusion optimizations that sometime cause problems, and the workarounds are not always simple.
Partitioning are supported, but like everything in the pipeline the number of partitions must be known in advance (thought it can be a runtime input parameter).
Apache Beam is very promising. It has in it some features (like windowing, state…) that you do not find in other frameworks. Currently an sdk spec for SQL is also in the works, so we hope that this will enhance the use of apache beam.
Overall using google dataflow as a service is very good, and as usual competition is good for everyone.