At a high level backfilling in Airflow is a mechanism that allows (re-)execution of Airflow DAG instances for a specific time interval. It is a useful but inherently fragile and brittle process in Airflow. Common use-cases for backfilling data are:
- the data pipeline code changed (a bugfix for example) and the output data should be refreshed using the new pipeline code
- an external service (an SFTP for example) was offline for a period of time and a number of DAG instances did not succeed
- in a development environment the pipeline code should be tested to verify it processed input data correctly and produces expected output data
A common gotcha with backfill in Airflow is that the scheduler can not very well handle backfill jobs that trigger dozens to hundreds of DAG instances (tested in pre 2.0 Airflow versions). It is unclear to me exactly why this happens but I found that if a large backfill job is split up into multiple smaller ones, Airflow can handle that just fine.
I thus wrote a simple tool: abf - a CLI tool that issue backfill commands to Airflow in small portions. One specifies the date-range to backfill over:
-start_dt 2021-02-04 -end_dt 2021-03-05 and the
--interval 6h to split the timespan in (6 hours)
in this example and the tool will issue backfill commands for 6 hour portions one after another
for the timespan specified.
Lastly, while slightly overkill for a tool like this, I took some time to set up a GitHub actions based automated lint and test step using the Python tool fourmat that combines Flake8 with Black and isort.
There is currently no pypi package for it so to install you have to clone the repo. I might go through that exercise next.