Skip to content

bioexcel/biobb-airflow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BioBB Apache Airflow implementation

This repo hosts a Docker Compose implementation for Apache Airflow.

Prepare configuration files

docker-compose.yml

The docker-compose.yaml is the file that specifies what images are required, what ports they need to expose, whether they have access to the host filesystem, what commands should be run when they start up, and so on.

.env file

⚠️ No sensible default value is provided for any of these fields, they need to be defined ⚠️

An .env file must be created in the project folder. The file .env.git can be taken as an example. The file must contain the following environment variables:

key value description
AIRFLOW_IMAGE_NAME string Image version used for Apache Airflow
AIRFLOW_UID number Airflow user identifier
AIRFLOW_PROJ_DIR string Absolute path to this Apache Airflow implementation

Utils

airflow_cwl_utils.py

The airflow_cwl_utils.py file is a utility module shared across all DAGs in the Apache Airflow setup. It acts as the bridge between Airflow and CWL (Common Workflow Language). It has two responsibilities:

  • resolve_inputs(): Reads a step's input YAML file and resolves references before execution.
  • create_bash_command(): Builds the shell command string that Airflow's BashOperator will execute for each step.

Plugins

cwl_run.sh

The cwl_run.sh is the shell script that actually executes a single CWL workflow step. It's called by every BashOperator task via create_bash_command() in airflow_cwl_utils.py.

docker_wrapper.sh

Why docker_wrapper.sh? Because cwltool normally calls docker run directly, but the apache worker is itself a container. The wrapper remaps paths from the container's /opt/airflow/... namespace to the host's real paths, so Docker-in-Docker mounts work correctly.

CWL Airflow

The x-airflow-common is a YAML anchor — a reusable configuration block that avoids repeating the same settings across every Airflow service. In this implementation it has been customized for installing cwltool, needed for executing the BioExcel Building Blocks on Apache Airflow.

The CWL Airflow Dockerfile installs cwltool in each of the Apache Airflow Services automatically.

Build services

First off, go to the project root folder.

For building the services via Docker Compose, please execute the following instruction:

docker compose build

Deploy services:

docker compose up -d

Lists the containers:

$ docker ps -a
CONTAINER ID   IMAGE                          COMMAND                  CREATED        STATUS                     PORTS                    NAMES
<ID>           docker-airflow-worker          "/usr/bin/dumb-init …"   16 hours ago   Up 16 hours (healthy)      8080/tcp                 <NAME>
<ID>           docker-airflow-apiserver       "/usr/bin/dumb-init …"   16 hours ago   Up 16 hours (healthy)      0.0.0.0:8080->8080/tcp   <NAME>
<ID>           docker-airflow-triggerer       "/usr/bin/dumb-init …"   16 hours ago   Up 16 hours (healthy)      8080/tcp                 <NAME>
<ID>           docker-airflow-scheduler       "/usr/bin/dumb-init …"   16 hours ago   Up 16 hours (healthy)      8080/tcp                 <NAME>
<ID>           docker-airflow-dag-processor   "/usr/bin/dumb-init …"   16 hours ago   Up 16 hours (healthy)      8080/tcp                 <NAME>
<ID>           docker-airflow-init            "/bin/bash -c 'if [[…"   16 hours ago   Exited (0) 16 hours ago                             <NAME>
<ID>           postgres:16                    "docker-entrypoint.s…"   16 hours ago   Up 16 hours (healthy)      5432/tcp                 <NAME>
<ID>           nginx:alpine                   "/docker-entrypoint.…"   16 hours ago   Up 16 hours                0.0.0.0:8888->80/tcp     <NAME>
<ID>           redis:7.2-bookworm             "docker-entrypoint.s…"   16 hours ago   Up 16 hours (healthy)      6379/tcp                 <NAME>

Execute services

Apache Airflow

Open a browser and type:

http://localhost:8080/

Outputs server

Apache Airflow doesn't serve output files because it was designed as a workflow orchestrator, not a data platform. Its core philosophy is:

"Airflow schedules and monitors tasks. What those tasks do with data is not Airflow's concern."

So, in this implementation, an Nginx server for accessing the outputs via web is provided.

Once a workflow has run, open a workflow and type:

http://localhost:8888/<WF NAME>/outputs/<STEP>/<FILE NAME>

Shutdown

Shutdown all the Airflow services:

docker compose down

Tips

Clear DAG from Database (sometimes stale DAGs are cahced and must be removed, ie if changing the DAG name):

docker compose exec airflow-apiserver airflow dags delete <dag_name_to_remove> -y

Copyright & Licensing

This software has been developed in the MMB group at the IRB for the European BioExcel, funded by the European Commission (EU Horizon Europe 101093290, EU H2020 823830, EU H2020 675728).

Licensed under the Apache License 2.0, see the file LICENSE for details.

About

This repo hosts a Docker Compose implementation for Apache Airflow.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors