An Effective Airflow Setup

Airflow is “a platform to programmatically author, schedule and monitor workflows”. It’s the new kid on the block when it comes to formalizing workflows, a.k.a pipelines. This post is for you if you’re trying to get a decent Airflow environment setup.

There is a whole bunch of configuration options possible within Airflow. The basic configuration is defined in the airflow.cfg, an INI-style text file. However, options can be overwritten using appropriate environment variables. This is particularly handy for sensitive information (think passwords) and stuff that tends to change a lot (think hostnames).

I have made good experience setting up Airflow with the following approach:

Keep the airflow.cfg as untouched as possible.
Create a BASH script which defines the default configuration options as environment variables.
Create another “local” BASH script which overwrites the first one.
Source both files whenever interacting with Airflow.

It seems paradox to not change the configuration file. After all, that’s what it is for. There’s a good reason for that though: Documentation. In order to find out the changes applied to the config file, you’d need to compare it to a vanilla one or browse through the commit history. On the other hand, if all changes were to be put in another place, it is immediately obvious what is needed for the current Airflow setup.

Let’s get more concrete. Consider the following setup:

$ ls -a /path/to/repo
.gitignore
airflow.cfg
setup_airflow_env.local.sh
setup_airflow_env.sh

The airflow.cfg comes from the default installation and stays unchanged. The setup_airflow_env.sh exports the environment variables that overwrite the text file based configuration. As an example, it could look like this:

$ cat setup_airflow_env.sh
export AIRFLOW__CORE__AIRFLOW_HOME="/path/to/our/airflow/stuff"
export AIRFLOW_HOME="${AIRFLOW__CORE__AIRFLOW_HOME}"
export AIRFLOW__CORE__DAGS_FOLDER="${AIRFLOW__CORE__AIRFLOW_HOME}/dags"
export AIRFLOW__CORE__SQL_ALCHEMY_CONN="mysql://airflow:airflow@localhost:3306/airflow"
export AIRFLOW__CORE__EXECUTOR="CeleryExecutor"
export AIRFLOW__CELERY__CELERY_RESULT_BACKEND="db+mysql://airflow:airflow@localhost:3306/airflow"
export AIRFLOW__CORE__LOAD_EXAMPLES="False"
export AIRFLOW_CONN_AIRFLOW_DB="mysql://airflow:airflow@localhost:3306/airflow"

This way, we already made it very clear what Airflow configuration we had to change or intend to. For example, the airflow.cfg’s [core] section contains the dags_folder setting. If the environment variable AIRFLOW__CORE__DAGS_FOLDER is set, Airflow will use its value instead.

However, this file ends up in the repository so it should not contain any sensitive information. On the other hand, the setup_airflow_env.local.sh is site-specific and redefines - if needed - the environment variables:

$ cat setup_airflow_env.sh
export AIRFLOW__CORE__AIRFLOW_HOME="/other/path/to/our/airflow/stuff"
export AIRFLOW_HOME="${AIRFLOW__CORE__AIRFLOW_HOME}"
export AIRFLOW__CORE__DAGS_FOLDER="${AIRFLOW__CORE__AIRFLOW_HOME}/dags"
...
export AIRFLOW__CORE__SQL_ALCHEMY_CONN="mysql://airflow:realpassword@realhostname:3306/airflow"
...
export AIRFLOW__CELERY__CELERY_RESULT_BACKEND="db+mysql://airflow:realpassword@realhostname:3306/airflow"
...
export AIRFLOW_CONN_AIRFLOW_DB="mysql://airflow:realpassword@realhostname:3306/airflow"
...
$ cat .gitignore
setup_airflow_env.local.sh # never share this file

We have now prepared the environment in a way that makes it easy to understand and is also pretty flexible. Before running any Airflow command, one simply sources both BASH scripts, albeit in the correct order. The final configuration is then derived by Airflow from either airflow.cfg (least precedence), setup_airflow_env.sh or setup_airflow_env.local.sh (highest precedence). A typical Airflow session might be something along these lines:

$ alias srcairflow='source /path/to/setup_airlfow_env.sh && source /path/to/setup_airflow_env.local.sh'
$ srcairflow
$ airflow webserver &
$ airflow scheduler &
$ airflow worker

As a developer, you’ll put the alias in your .bashrc and ready you are. In order to launch the webserver on my development version, I do:

# Press CTRL+C to restart the reread the Airflow config and restart the web interface.
$ while true; do srcairflow ; airflow webserver ; done

Deployment is also simplified: The startup script simply needs to source both files before starting the scheduler or any worker. As an example, a simple start_worker.sh would look like this:

#!/bin/bash
set -e # Any subsequent commands which fail will cause the shell script to exit immediately
THIS_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
source $THIS_DIR/setup_airlfow_env.sh
source $THIS_DIR/setup_airflow_env.local.sh
airflow worker --daemon $@

The local environment setup script can be prepared by a Jenkins job. That’ll allow you to painlessly deploy the same code to any number of environments.

By the way, setting set -e in any script forces BASH to abort on the first failing command. This setting enables you to enforce the creation of setup_airlfow_env.local.sh. The script will abort the moment it tries to source the non-existing file.