airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeremiah Lowin <jlo...@apache.org>
Subject Re: Flow-based Airflow?
Date Wed, 01 Feb 2017 23:14:24 GMT
Great point. I think the best solution is to solve this for all XComs by
checking object size before adding it to the DB. I don't see a built in way
of handling it (though apparently MySQL is internally limited to 64kb).
I'll look into a PR that would enforce a similar limit for all databases.

On Wed, Feb 1, 2017 at 4:52 PM Maxime Beauchemin <maximebeauchemin@gmail.com>
wrote:

I'm not sure about XCom being the default, it seems pretty dangerous. It
just takes one person that is not fully aware of the size of the data, or
one day with an outlier and that could put the Airflow db in jeopardy.

I guess it's always been an aspect of XCom, and it could be good to have
some explicit gatekeeping there regardless of this PR/feature. Perhaps the
DB itself has protection against large blobs?

Max

On Wed, Feb 1, 2017 at 12:42 PM, Jeremiah Lowin <jlowin@apache.org> wrote:

> Yesterday I began converting a complex script to a DAG. It turned out to
be
> a perfect test case for the dataflow model: a big chunk of data moving
> through a series of modification steps.
>
> So I have built an extensible dataflow extension for Airflow on top of
XCom
> and the existing dependency engine:
> https://issues.apache.org/jira/browse/AIRFLOW-825
> https://github.com/apache/incubator-airflow/pull/2046 (still waiting for
> tests... it will be quite embarrassing if they don't pass)
>
> The philosophy is simple:
> Dataflow objects represent the output of upstream tasks. Downstream tasks
> add Dataflows with a specific key. When the downstream task runs, the
> (optionally indexed) upstream result is available in the downstream
context
> under context['dataflows'][key]. In addition, PythonOperators receive the
> data as a keyword argument.
>
> The basic Dataflow serializes the data through XComs, but is trivially
> extended to alternative storage via subclasses. I have provided (in
> contrib) implementations of a local filesystem-based Dataflow as well as a
> Google Cloud Storage dataflow.
>
> Laura, I hope you can have a look and see if this will bring some of your
> requirements in to Airflow as first-class citizens.
>
> Jeremiah
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message