airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Laura Lorenz <>
Subject Re: Contrib & Dataflow
Date Thu, 09 Feb 2017 18:12:37 GMT
Re: data storage and file reference metadata as a process of the
post_execute hook
I'm interested to hear more on this idea, as I can't visualize how (or if)
that will implement multi-backend IO and either a standard or drop in
serialization of result objects.

I did just comment on the PR
re: Max's comments since I wasn't totally sure where that conversation
should be had, but can move it over here if we want more visibility.

Re: breaking out repos
I know this has had some support for a while from eavesdropping this list
or committer meeting reports, but I want to throw out some of the gotchas
we experienced from deriving our own plugins (using the Airflow plugin
system <>) and then, when
that was too unwieldy for us because of the plugin module discovery system,
packaging some of our custom operators and hooks separately (fileflow
<>). In the latter case, which
is closer to what you are proposing, we had problems patching into the core
Airflow configuration management system
Now this could have been just us (or fixed up since Airflow 1.7.0, which is
the version we are still operating on) but is just a word of caution on
things to consider or redesign, given what we experienced packaging Airflow
add ons separately.

On Sat, Feb 4, 2017 at 1:45 PM, Jeremiah Lowin <> wrote:

> Max made some great points on my dataflow PR and I wanted to continue the
> conversation here to make sure the conversation was visible to all.
> While I think my dataflow implementation contains the basic requirements
> for any more complicated extension (but that conversation can wait!), I had
> to implement it by adding some very specific "dataflow-only" code to core
> Operator logic. In retrospect, that makes me pause (as, I believe, it did
> for Max).
> After thinking for a few days, what I really want to do is propose a very
> small change to core Airflow: change BaseOperator.post_execute(context) to
> BaseOperator.post_execute(result, context). I think the pre_execute and
> post_execute hooks have generally been an afterthought, but with that
> change (which, I think, is reasonable in and of itself) I could implement
> entirely through those hooks.
> So that brings me to my next point: if the hook is changed, I could happily
> drop a reworked dataflow implementation into contrib, rather than core.
> That would alleviate some of the pressure for Airflow to officially decide
> whether it's the right implementation or not (it is! :) ). I feel like that
> would be the optimal situation at the moment.
> And that brings me to my next point: the future of "contrib" and the
> Airflow community.
> Having contrib in the core Airflow repo has some advantages:
>   - standardized access
>   - centralized repository for PRs
>   - at least a style review (if not unit tests) from the committers
> But some big disadvantages as well:
>   - Very complicated dependency management [presumably, most contrib
> operators need to add an extras_require entry for their specific
> dependencies]
>   - No sense of ownership or even an easy way to raise issues (due to
> friction of opening JIRA tickets vs github issues)
> One thought is to move the contrib directory to its own repo which would
> keep the advantages but remove the disadvantages from core Airflow. Another
> is to encourage individual airflow repos (Airflow-Docker, Airflow-Dataflow,
> Airflow-YourExtensionHere) which could be installed a la carte. That would
> leave maintenance up to the original author, but could lead to some
> fracturing in the community as discovery becomes difficult.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message