An Effective Airflow Setup
Airflow is “a platform to programmatically author, schedule and monitor workflows”. It’s the new kid on the block when it comes to formalizing workflows, a.k.a pipelines. This post is for you if you’re trying to get a decent Airflow environment setup.
There is a whole bunch of configuration options possible within Airflow.
The basic configuration is defined in the airflow.cfg
, an INI-style text file.
However, options can be overwritten using appropriate environment variables.
This is particularly handy for sensitive information (think passwords) and stuff that tends to change a lot (think hostnames).
I have made good experience setting up Airflow with the following approach:
- Keep the
airflow.cfg
as untouched as possible. - Create a BASH script which defines the default configuration options as environment variables.
- Create another “local” BASH script which overwrites the first one.
- Source both files whenever interacting with Airflow.
It seems paradox to not change the configuration file. After all, that’s what it is for. There’s a good reason for that though: Documentation. In order to find out the changes applied to the config file, you’d need to compare it to a vanilla one or browse through the commit history. On the other hand, if all changes were to be put in another place, it is immediately obvious what is needed for the current Airflow setup.
Let’s get more concrete. Consider the following setup:
The airflow.cfg
comes from the default installation and stays unchanged.
The setup_airflow_env.sh
exports the environment variables that overwrite the text file based configuration.
As an example, it could look like this:
This way, we already made it very clear what Airflow configuration we had to change or intend to.
For example, the airflow.cfg
’s [core]
section contains the dags_folder
setting.
If the environment variable AIRFLOW__CORE__DAGS_FOLDER
is set, Airflow will use its value instead.
However, this file ends up in the repository so it should not contain any sensitive information.
On the other hand, the setup_airflow_env.local.sh
is site-specific and redefines - if needed - the environment variables:
We have now prepared the environment in a way that makes it easy to understand and is also pretty flexible.
Before running any Airflow command, one simply sources both BASH scripts, albeit in the correct order.
The final configuration is then derived by Airflow from either airflow.cfg
(least precedence), setup_airflow_env.sh
or setup_airflow_env.local.sh
(highest precedence).
A typical Airflow session might be something along these lines:
As a developer, you’ll put the alias in your .bashrc
and ready you are.
In order to launch the webserver on my development version, I do:
Deployment is also simplified: The startup script simply needs to source both files before starting the scheduler or any worker.
As an example, a simple start_worker.sh
would look like this:
The local environment setup script can be prepared by a Jenkins job. That’ll allow you to painlessly deploy the same code to any number of environments.
By the way, setting set -e
in any script forces BASH to abort on the first failing command.
This setting enables you to enforce the creation of setup_airlfow_env.local.sh
.
The script will abort the moment it tries to source the non-existing file.