airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeremiah Lowin <>
Subject Re: Flow-based Airflow?
Date Wed, 01 Feb 2017 20:42:00 GMT
Yesterday I began converting a complex script to a DAG. It turned out to be
a perfect test case for the dataflow model: a big chunk of data moving
through a series of modification steps.

So I have built an extensible dataflow extension for Airflow on top of XCom
and the existing dependency engine: (still waiting for
tests... it will be quite embarrassing if they don't pass)

The philosophy is simple:
Dataflow objects represent the output of upstream tasks. Downstream tasks
add Dataflows with a specific key. When the downstream task runs, the
(optionally indexed) upstream result is available in the downstream context
under context['dataflows'][key]. In addition, PythonOperators receive the
data as a keyword argument.

The basic Dataflow serializes the data through XComs, but is trivially
extended to alternative storage via subclasses. I have provided (in
contrib) implementations of a local filesystem-based Dataflow as well as a
Google Cloud Storage dataflow.

Laura, I hope you can have a look and see if this will bring some of your
requirements in to Airflow as first-class citizens.


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message