Adventures With Airflow
Big data and big scale bring the need to build scalable data pipelines and workflows.
There are a lot of different tools and platforms to achieve this goal, such as Luigi (open sourced by Spotify), Pinball (open sourced by Pinterest), Azkaban and more.
Today we’ll take a look at the new up and coming star - Apache Airflow.
After reading this article, you’ll have an understanding of the basic concepts of Airflow, how to define a workflow and what to consider before choosing Airflow for your project.
In the Additional Reading section, you’ll find some good resources to get you starting as well as in depth comparisons to Luigi and Pinball.
Airflow was started in October 2014 by Maxime Beauchemin at Airbnb, and joined the Apache Software Foundation’s incubation program in March 2016.
What is Airflow?
“Airflow is a platform to programmatically author, schedule and monitor workflows.”
Airflow in 20 seconds:
In Airflow, you describe workflows as DAGs - directed acyclic graphs. Each DAG is comprised of different tasks.
A task can be dependent on the success of other tasks in order to run.
The airflow scheduler executes these tasks on an array of workers while following the specified dependencies.
Airflow has a nice UI that displays your DAGs and tasks, allowing you to monitor your runs and access each task’s logs with a press of a button.
The tasks and dags are defined using python scripts.
At this point it’s all a bit abstract and confusing. You might feel a bit lost, right?
So let’s break it down!
Airflow DAG example
Let’s look at the following dag:
Here, We have a single DAG with 6 tasks, ordered to run from left to right:
1) Run the first task (named run_this_first)
2) Decide which path to take based on a condition in the task named branching
3) If branching returned true, execute brancha, then execute follow_branch_a.
If branching returned false, branch_false is executed.
4) Either way, the last task to run is join.
A task is created by instantiating an Airflow Operator class (or subclass).
Airflow has many (many) built in Operators you can use out of the box - including BashOperator (the runs a simple Bash command), EmailOperator (sends an email), HdfsSensor (Waits for a file or folder to land in HDFS), HiveOperator (Executes hql code in a specific Hive database) and… You get the idea.
Right now, I bet you’re wondering - how do you define a DAG? How do you define a task?
Some code snippets
dag = DAG( ‘tutorial’, defaultargs=defaultargs, scheduleinterval=timedelta(1))
- ‘tutorial’ is the dag_id, which is a unique identifier for our DAG
- default_args is the default argument dictionary, shared by all the tasks of the DAG.
- schedule_interval is the interval to run the dag (can be a cron expression). Here it is an interval of 1 day.
t1 = BashOperator( taskid=’printdate’, bashcommand=’date’, dag=dag)
- t1 is of type BashOperator - there are many (many) types of operators
- task_id - unique identifier for the task
- bash_command - the bash command to run in this BashOperator task
- dag - the dag to which this task belongs
Task dependency definition
Let’s say we have three tasks - t1, t2, t3.
We want to build a dag of t1 -> t2 -> t3.
Here’s how to do it:
t2 will depend on t1 running successfully to run (same as t1.setdownstream(t2))
t3 will depend on t2 running successfully to run (same as t3.setupstream(t2))
So Why should you use Airflow?
Because it is:
Dynamic - pipelines are configured as python code, allowing to instantiate pipelines dynamically.
Extensible - You can define your own operators, hooks and executors, and extend the library. You can reuse tasks among many DAGs.
Elegant And Flexible - pipelines are lean and explicit. You can parameterize your scripts and use Jinja templating engine.
Scalable - Airflow can scale to an arbitrary number of workers.
But also, when dealing with Airflow keep in mind
There’s a steep learning curve - a lot of operators, executors and hooks.
- Information is hard to get:
- Documentation is lacking
- Not a lot of StackOverflow Q&As
Feels like a beta product with a lot of bugs and quirks (it is considered an incubation project).
You must know python.
After working with Airflow I discovered a very powerful tool.
With it, I was able to build flows with great complexity and a lot of dependencies.
Since workflows are defined as code, they become more maintainable, versionable, testable, and collaborative.
I think that the vast amount of out-of-the-box operators, the fact that it contains its own scheduler and its nice UI are good reasons to consider Airflow.
However, the steep learning curve, sparse information and many bugs and quirks are things to consider before attempting to use it, since it makes the on-boarding process quite lengthy.
Thanks for reading!
- Airflow project homepage: https://airflow.apache.org/index.html
- Luigi vs Airflow vs Pinball comparison: http://bytepawn.com/luigi-airflow-pinball.html
- In depth luigi vs Airflow: https://towardsdatascience.com/why-quizlet-chose-apache-airflow-for-executing-data-workflows-3f97d40e9571
- A second opinion: http://blog.paracode.com/2017/10/08/airflow/
- Project popularity comparison with Github stars: http://www.timqian.com/star-history/#apache/incubator-airflow&spotify/luigi&apache/oozie&azkaban/azkaban