airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeremiah Lowin <jlo...@apache.org>
Subject Re: Flow-based Airflow?
Date Wed, 01 Feb 2017 20:42:00 GMT
Yesterday I began converting a complex script to a DAG. It turned out to be
a perfect test case for the dataflow model: a big chunk of data moving
through a series of modification steps.

So I have built an extensible dataflow extension for Airflow on top of XCom
and the existing dependency engine:
https://issues.apache.org/jira/browse/AIRFLOW-825
https://github.com/apache/incubator-airflow/pull/2046 (still waiting for
tests... it will be quite embarrassing if they don't pass)

The philosophy is simple:
Dataflow objects represent the output of upstream tasks. Downstream tasks
add Dataflows with a specific key. When the downstream task runs, the
(optionally indexed) upstream result is available in the downstream context
under context['dataflows'][key]. In addition, PythonOperators receive the
data as a keyword argument.

The basic Dataflow serializes the data through XComs, but is trivially
extended to alternative storage via subclasses. I have provided (in
contrib) implementations of a local filesystem-based Dataflow as well as a
Google Cloud Storage dataflow.

Laura, I hope you can have a look and see if this will bring some of your
requirements in to Airflow as first-class citizens.

Jeremiah

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message