How to prevent airflow from backfilling dag runs?
Say you have an airflow DAG that doesn't make sense to backfill, meaning that, after it's run once, running it subsequent times quickly would be completely pointless.
For example, if you're loading data from some source that is only updated hourly into your database, backfilling, which occurs in rapid succession, would just be importing the same data again and again.
This is especially annoying when you instantiate a new hourly task, and it runs
N amount of times for each hour it missed, doing redundant work, before it starts running on the interval you specified.
The only solution I can think of is something that they specifically advised against in FAQ of the docs
We recommend against using dynamic values as start_date, especially
datetime.now()as it can be quite confusing.
Is there any way to disable backfilling for a DAG, or should I do the above?
Upgrade to airflow version 1.8 and use catchup_by_default=False in the airflow.cfg or apply catchup=False to each of your dags.
Scheduler — Airflow Documentation, Each DAG may or may not have a schedule, which informs how DAG Runs are created. schedule_interval airflow dags backfill \ --start-date START_DATE \ -- end-date END_DATE \ dag_id This can be used to stop running task instances. Say you have an airflow DAG that doesn't make sense to backfill, meaning that, after it's run once, running it subsequent times quickly would be completely pointless. For example, if you're loading data from some source that is only updated hourly into your database, backfilling, which occurs in rapid succession, would just be importing the
This appears to be an unsolved Airflow problem. I know I would really like to have exactly the same feature. Here is as far as I've gotten; it may be useful to others.
The are UI features (at least in 126.96.36.199) which can help with this problem. If you go to the Tree view and click on a specific task (square boxes), a dialog button will come up with a 'mark success' button. Clicking 'past', then clicking 'mark success' will label all the instances of that task in DAG as successful and they will not be run. The top level DAG (circles on top) can also be labeled as successful in a similar fashion, but there doesn't appear to be way to label multiple DAG instances.
I haven't looked into it deeply enough yet, but it may be possible to use the 'trigger_dag' subcommand to mark states of DAGs. see here: https://github.com/apache/incubator-airflow/pull/644/commits/4d30d4d79f1a18b071b585500474248e5f46d67d
A CLI feature to mark DAGs is in the works: http://mail-archives.apache.org/mod_mbox/airflow-commits/201606.mbox/%3CJIRA.12973462.1464369259000.37918.1465189859133@Atlassian.JIRA%3E https://github.com/apache/incubator-airflow/pull/1590
UPDATE (9/28/2016): A new operator 'LatestOnlyOperator' has been added (https://github.com/apache/incubator-airflow/pull/1752) which will only run the latest version of downstream tasks. Sounds very useful and hopefully it will make it into the releases soon
UPDATE 2: As of airflow 1.8, the
LatestOnlyOperator has been released.
DAG Runs — Airflow Documentation, Keep in mind the start_date is not necessarily when the first DAG run Because Airflow can backfill past DAG runs when catchup is enabled� Tasks in the DAG are not "past dependent", meaning during a backfill they can be executed in any date order. If I need to backfill a task in the DAG, I clear all the task instances (from today to past) using UI, then all DAG runs switch to "running" state and the task start backfilling from 1 Jan 2015 to today.
Setting catchup=False in your dag declaration will provide this exact functionality.
I don't have the "reputation" to comment, but I wanted to say that catchup=False was designed (by me) for this exact purpose. In addition, I can verify that in 1.10.1 it is working when set explicitly in the instantiation. However I do not see it working when placed in the default args. I've been away from Airflow for 18 months though, so it will be a bit before I can take a look at why the default args isn't working for catchup.
dag = DAG('example_dag', max_active_runs=3, catchup=False, schedule_interval=timedelta(minutes=5), default_args=default_args)
Apache Airflow Tips and Best Practices | by Xinran Waibel, At Nextdoor, the data team uses Airflow to orchestrate data transfer Sometimes the start date set in the DAG code may be many days In such cases, Airflow will try to “backfill” from start date to current by running for all the� Backfilling made easy. With UI built with Bootstrap 4, backfilling is just a piece of cake. No need to login in your Airflow Environment VM/Setup every time to run command line for backfill and clearing DAG runs. This plugin easily integrates with Airflow webserver and makes your tasks easier by giving you the same control as command line does.
How to trick Airflow to reduce wasteful processing?, end_date: The date the DAG should stop running, usually set as none. https:// airflow.apache.org/scheduler.html#backfill-and-catchup from airflow import DAG � An Airflow DAG with a start_date, possibly an end_date, and a schedule_interval defines a series of intervals which the scheduler turn into individual Dag Runs and execute. A key capability of Airflow is that these DAG Runs are atomic, idempotent items, and the scheduler, by default, will examine the lifetime of the DAG (from start to end/now
Scheduling Tasks in Airflow, Say you have an airflow DAG that doesn't make sense to backfill, meaning that, after it's run once, running it subsequent times quickly would be completely� How to prevent airflow from backfilling dag runs? 0. Possible reasons for an airflow backfill to interrupt after some DAG runs. Hot Network Questions
How to prevent airflow from backfilling dag runs? - Article, Follow these tips to avoid the pitfalls I had to learn the hard. airflow backfill DAG -s DATE -e : The date passed is both the start and end date. The end Airflow scheduled runs: next_execution_date is the date of execution. If it's critical that you are alerted if DAG runs take longer than 4 hours in the ordinary (non-backfill) scenario, then I'd add 4 hour SLAs on all the tasks. When you clear tasks for backfill, it will immediately trigger the SLA misses, but at least they should all happen at once in bulk and won't fail your runs.