airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ash Berlin-Taylor <...@apache.org>
Subject Re: [PROPOSAL][AIP-36 DAG Versioning]
Date Tue, 11 Aug 2020 18:33:06 GMT
Anything to doing with the process of building wheels should be a "power user" only feature,
and should not be required for many users - many many users of airflow are not primarily Python
developers, but data scientists, and needing them to understand anything about the python
build toolchain is too much of a learning curve for the benefit.

After all it is very rare that people hit the multiple concurrent versions of a dag.

-ash

On 10 August 2020 17:37:32 BST, Tomasz Urbaszek <turbaszek@apache.org> wrote:
>I like the idea of wheels as this is probably the "most pythonic"
>solution. And "DAG version" is not only defined by DAG code but also
>by all dependencies the DAG uses (custom functions, libraries etc) and
>it seems that wheels can address that.
>
>However, I second Ash - keeping wheels in db doesn't sound good. In my
>opinion, DAG fetcher is the right solution and the idea surfaces every
>time we talk about serialization. This abstraction has a lot of pros
>as it allows a lot of customization (wheels, local fs, remote fs,
>wheels etc).
>
>Apart from that, if we decided to use wheels we should provide a CLI
>command to ease the process of building them. Also, I'm wondering
>about developers' workflow. Moving between code of different DAG
>version sounds easy if you use git but... what if someone doesn't use
>it?
>
>Tomek
>
>
>On Sat, Aug 8, 2020 at 9:49 AM Ash Berlin-Taylor <ash@apache.org>
>wrote:
>>
>> Quick comment (as I'm still mostly on paternity leave):
>>
>> Storing wheels in the db sounds like a bad Idea to me, especially if
>we need to store deps in there too (and if we don't store deps, then
>they are incomplete) - they could get very large, and I've stored blobs
>of ~10mb in postgres before: I don't recommend it. It "works" but
>operating it is tricky.
>>
>>
>>
>> > the API could simply accept "Wheel file + the Dag id"
>>
>> This sounds like a huge security risk.
>>
>>
>> My main concern with this idea is that it seems a lot of complexity
>we are putting on users. Doubly so if they are already using docker
>where there already exists an Ideal packaging and distribution that
>could contain dag + needed code.
>>
>> (Sorry for the brevity)
>>
>> -ash
>>
>>
>> On 2 August 2020 08:47:39 BST, Jarek Potiuk
><Jarek.Potiuk@polidea.com> wrote:
>> >Few points from my sid (and proposal!):
>> >
>> >1) Agree with Max -  with a rather strong NO for pickles (however,
>> >indeed cloudpickle solves some of the problems). Pickles came up in
>> >our discussion in Polidea recently and the overall message was "no".
>I
>> >agree with Max here - if we can ship python code, turning that into
>> >pickle for transit makes little sense to me and brings a plethora of
>> >problems.
>> >
>> >2) I think indeed the versioning solution should treat the "DagRun"
>> >structure atomically. While I see why we would like to go with the
>> >UI/Scheduler only first rather than implementing them in the
>workers,
>> >adding the "mixed version" is where it breaks down IMHO. Reasoning
>> >about such "mixed version" dag is next to impossible. The current
>> >behavior is not well defined and non-deterministic (depends on
>> >scheduler delays, syncing, type of deployment, restarts of the works
>> >etc.) we are moving it up to UI (thus users) rather than solving the
>> >problem. So I am not a big fan of this and would rather solve it
>> >"well" with atomicity.
>> >
>> >3) I see the point of Dan as well - we had many discussions and many
>> >times the idea about "submitting" the DAG for execution via the API
>> >came up - and it makes sense IMHO.
>> >
>> >Proposal: Implement full versioning with code shipping via DB wheels
>> >BLOB (akin to serialized DAGs).
>> >
>> >I understand that the big issue is how to actually "ship" the code
>to
>> >the worker. And - maybe a wild idea - we can kill several birds with
>> >the same stone.
>> >
>> >There were plenty of discussions on how we could do that but one was
>> >never truly explored - using wheel packages.
>> >
>> >For those who do not know them, there is the PEP:
>> >https://www.python.org/dev/peps/pep-0427/
>> >
>> >Wheels allow to "package" python code in a standard way. They are
>> >portable ("purelib" + contain .py rather than .pyc code), they have
>> >metadata, versioning information, they can be signed for security,
>> >They can contain other packages or python code, Why don't we let
>> >scheduler to pack the fingerprinted version of the DAG in a .whl and
>> >store it as a blob in a DB next to the serialized form?
>> >
>> >There were concerns about the size of the code to keep in the DB -
>but
>> >we already use the DB for serialized DAGs and it works fine (I
>believe
>> >we only need to add compressing of the JSon serialized form - as
>we've
>> >learned from AirBnb during their talk at the Airflow Summit - wheels
>> >are already compressed). Also - each task will only need the
>> >particular "version" of one DAG so even if we keep many of them in
>the
>> >DB, the old version will pretty soon go "cold" and will never be
>> >retrieved (and most DBs will handle it well with caching/indexes).
>> >
>> >And if we want to add "callables" from other files - there is
>nothing
>> >to stop the person who defines dag to add list of files that should
>be
>> >packaged together with the main DAG file (additional_python_files =
>> >["common/my_fantastic_library.py"] in DAG constructor). Or we could
>> >auto-add all files after the DAG gets imported (i.e. package
>> >automatically all files that are imported for that particular DAG
>from
>> >the "dags" folder"). That should be rather easy.
>> >
>> >This way we could ship the code to workers for the exact version
>that
>> >the DagRun uses. And they can be cached and unpacked/installed to a
>> >virtualenv for the execution of that single task. That should be
>super
>> >quick. Such virtualenv can be wiped out after execution.
>> >
>> >Then we got what Max wants (atomicity of DagRuns) and what Dan wants
>> >(the API could simply accept "Wheel file + the Dag id". We have the
>> >isolation between tasks running on the same worker (based on
>> >virtualenv) so that each process in the same worker can run a
>> >different version of the same Dag. We have much less confusion for
>the
>> >UI.
>> >
>> >Extra bonus 1: we can expand it to package different dependencies in
>> >the wheels as well - so that if an operator requires a different
>> >(newer) version of a python library, it could be packaged together
>> >with the DAG in the same .whl file. This is also a highly requested
>> >feature.
>> >Extra bonus 2: workers will stop depending on the DAG file mount (!)
>> >which was our long term goal and indeed as Dan mentioned - a great
>> >step towards multi-tenancy.
>> >
>> >J.
>> >
>> >
>> >
>> >
>> >
>> >
>> >On Fri, Jul 31, 2020 at 6:41 AM Maxime Beauchemin
>> ><maximebeauchemin@gmail.com> wrote:
>> >>
>> >> Having tried it early on, I'd advocate pretty strongly against
>> >pickles and
>> >> would rather not get too deep into the why here. Short story is
>they
>> >can
>> >> pull the entire memory space or much more than you want, and it's
>> >> impossible to reason about where they end. For that reason and
>other
>> >> reasons, they're a security issue. Oh and some objects are not
>> >picklable
>> >> (Jinja templates! to name a problematic one...). I've also seen
>> >> secret-related classes that raise when pickled (thank god!).
>> >>
>> >> About callback and other things like that, it's quite a puzzle in
>> >python.
>> >> One solution would be to point to a python namespace
>> >> callback="preset.airflow_utils.slack_callback" and assume the
>> >function has
>> >> to exist in the remote interpreter. Personally I like the
>DagFetcher
>> >idea
>> >> (it could be great to get a pointer to that mailing list thread
>> >here),
>> >> specifically the GitDagFetcher. I don't know how [un]reasonable it
>> >is, but
>> >> I hate pickles so much that shipping source code around seems much
>> >more
>> >> reasonable to me. I think out there there's a talk from Mike Star
>> >about
>> >> Dataswarm at FB and he may mention how their workers may git
>shallow
>> >clone
>> >> the pipeline repo. Or maybe they use that "beautifully ugly" hack
>to
>> >use
>> >> a gitfs fuse [file system in user space] on the worker [could get
>> >deeper
>> >> into that, not sure how reasonable that is either].
>> >>
>> >> About fingerprints, a simple `start_date = datetime.now() -
>> >timedelta(1)`
>> >> may lead to a never-repeating fingerprint. From memory the spec
>> >doesn't
>> >> list out the properties considered to build the hash. It be
>helpful
>> >to
>> >> specify and review that list.
>> >>
>> >> Max
>> >>
>> >> On Wed, Jul 29, 2020 at 5:20 AM Kaxil Naik <kaxilnaik@gmail.com>
>> >wrote:
>> >>
>> >> > Thanks, both Max and Dan for your comments, please check my
>reply
>> >below:
>> >> >
>> >> >
>> >> > >  Personally I vote for a DAG version to be pinned and
>consistent
>> >for the
>> >> > > duration of the DAG run. Some of the reasons why:
>> >> > > - it's easier to reason about, and therefore visualize and
>> >troubleshoot
>> >> > > - it prevents some cases where dependencies are never met
>> >> > > - it prevents the explosion of artifact/metadata (one
>> >serialization per
>> >> > > dagrun as opposed to one per scheduler cycle) in the case of a
>> >dynamic
>> >> > DAG
>> >> > > whose fingerprint is never the same.
>> >> >
>> >> >
>> >> > In this AIP, we were only looking to fix the current "Viewing
>> >behaviour"
>> >> > and
>> >> > we were intentionally not changing the execution behaviour.
>> >> > The change you are suggesting means we need to introduce DAG
>> >Versioning for
>> >> > the
>> >> > workers too. This will need more work as can't use the
>Serialised
>> >> > Representation
>> >> > to run the task since users could use custom modules in a
>different
>> >part of
>> >> > code,
>> >> > example the PythonOperator has python_callable that allows
>running
>> >any
>> >> > arbitrary code.
>> >> > A similar case is with the *on_*_callbacks* defined on DAG.
>> >> >
>> >> > Based on the current scope of the AIP, we still plan to use the
>> >actual DAG
>> >> > files for the
>> >> > execution and not use Serialized DAGs for the workers.
>> >> >
>> >> > To account for all the custom modules we will have to start
>looking
>> >at
>> >> > pickle (cloudpickle).
>> >> >
>> >> > I'm certain that there are lots of
>> >> > > those DAGs out there, and that it will overwhelm the metadata
>> >database,
>> >> > and
>> >> > > confuse the users. For an hourly DAG is would mean 24 artifact
>> >per day
>> >> > > instead of 1000+
>> >> >
>> >> >
>> >> > What kind of dynamic DAGs are we talking about here, I would
>think
>> >the DAG
>> >> > signature won't change
>> >> > but I might be wrong, can you give an example, please.
>> >> >
>> >> > If backwards compatibility in behavior is a concern, I'd
>recommend
>> >adding a
>> >> > > flag to the DAG class and/or config and make sure we're doing
>the
>> >right
>> >> > > thing by default. People who want backward compatibility would
>> >have to
>> >> > > change that default. But again, that's a lot of extra and
>> >confusing
>> >> > > complexity that will likely be the source of bugs and user
>> >confusion.
>> >> > > Having a clear, easy to reason about execution model is super
>> >important.
>> >> >
>> >> > Think about visualizing a DAG that shapeshifted 5 times during
>its
>> >> > > execution, how does anyone make sense of that?
>> >> >
>> >> >
>> >> > Wouldn't that be an edge case? How often would someone change
>the
>> >DAG
>> >> > structure in the middle of
>> >> > a DAG execution. And since if they do change, the Graph View
>should
>> >show
>> >> > all the tasks that were
>> >> > run, if it just shows based on the latest version, the behaviour
>> >would be
>> >> > the same as now.
>> >> >
>> >> > --------
>> >> >
>> >> > Strongly agree with Max's points, also I feel the right way to
>go
>> >about
>> >> > > this is instead of Airflow schedulers/webservers/workers
>reading
>> >DAG
>> >> > Python
>> >> > > files, they would instead read from serialized representations
>of
>> >the
>> >> > DAGs
>> >> > > (e.g. json representation in the Airflow DB). Instead of DAG
>> >owners
>> >> > pushing
>> >> > > their DAG files to the Airflow components via varying
>mechanisms
>> >(e.g.
>> >> > > git), they would instead call an Airflow CLI to push the
>> >serialized DAG
>> >> > > representations to the DB, and for things like dynamic DAGs
>you
>> >could
>> >> > > populate them from a DAG or another service.
>> >> >
>> >> >
>> >> > Airflow Webserver and the Scheduler will definitely read from
>the
>> >> > Serialized representation as
>> >> > they don't need all the code from the DAG files.
>> >> >
>> >> > While the workers definitely need access to DAG files as the
>> >> > tasks/operators would be using
>> >> > code form custom modules and classes which are required to run
>the
>> >tasks.
>> >> >
>> >> > If we do want to go down that route we will have to use
>something
>> >like
>> >> > cloudpickle that serializes
>> >> > entire DAG file and their dependencies. And also ensure that
>> >someone is not
>> >> > able to change the pickled
>> >> > source when sending from executor to the worker as that poses a
>big
>> >> > security risk.
>> >> >
>> >> > - Kaxil
>> >> >
>> >> > On Wed, Jul 29, 2020 at 12:43 PM Jacob Ward
><jward@brandwatch.com>
>> >wrote:
>> >> >
>> >> > > I came here to say what Max has said, only less eloquently.
>> >> > >
>> >> > > I do have one concern with locking the version for a single
>run.
>> >> > Currently
>> >> > > it is possible for a user to create a dag which intentionally
>> >changes as
>> >> > a
>> >> > > dag executes, i.e. dynamically creating a task for the dag
>during
>> >a run
>> >> > by
>> >> > > modifying external data, but this change would prevent that.
>I'm
>> >of the
>> >> > > opinion that this situation is bad practice anyway so it
>doesn't
>> >matter
>> >> > if
>> >> > > we make it impossible to do, but others may disagree.
>> >> > >
>> >> > > On Tue, 28 Jul 2020 at 17:08, Dan Davydov
>> ><ddavydov@twitter.com.invalid>
>> >> > > wrote:
>> >> > >
>> >> > > > Strongly agree with Max's points, also I feel the right way
>to
>> >go about
>> >> > > > this is instead of Airflow schedulers/webservers/workers
>> >reading DAG
>> >> > > Python
>> >> > > > files, they would instead read from serialized
>representations
>> >of the
>> >> > > DAGs
>> >> > > > (e.g. json representation in the Airflow DB). Instead of
DAG
>> >owners
>> >> > > pushing
>> >> > > > their DAG files to the Airflow components via varying
>> >mechanisms (e.g.
>> >> > > > git), they would instead call an Airflow CLI to push the
>> >serialized DAG
>> >> > > > representations to the DB, and for things like dynamic DAGs
>you
>> >could
>> >> > > > populate them from a DAG or another service.
>> >> > > >
>> >> > > > This would also enable other features like stronger
>> >> > > security/multi-tenancy.
>> >> > > >
>> >> > > > On Tue, Jul 28, 2020 at 6:44 PM Maxime Beauchemin <
>> >> > > > maximebeauchemin@gmail.com> wrote:
>> >> > > >
>> >> > > > > > "mixed version"
>> >> > > > >
>> >> > > > > Personally I vote for a DAG version to be pinned and
>> >consistent for
>> >> > the
>> >> > > > > duration of the DAG run. Some of the reasons why:
>> >> > > > > - it's easier to reason about, and therefore visualize
and
>> >> > troubleshoot
>> >> > > > > - it prevents some cases where dependencies are never
met
>> >> > > > > - it prevents the explosion of artifact/metadata (one
>> >serialization
>> >> > per
>> >> > > > > dagrun as opposed to one per scheduler cycle) in the
case
>of
>> >a
>> >> > dynamic
>> >> > > > DAG
>> >> > > > > whose fingerprint is never the same. I'm certain that
>there
>> >are lots
>> >> > of
>> >> > > > > those DAGs out there, and that it will overwhelm the
>metadata
>> >> > database,
>> >> > > > and
>> >> > > > > confuse the users. For an hourly DAG is would mean 24
>> >artifact per
>> >> > day
>> >> > > > > instead of 1000+
>> >> > > > >
>> >> > > > > If backwards compatibility in behavior is a concern,
I'd
>> >recommend
>> >> > > > adding a
>> >> > > > > flag to the DAG class and/or config and make sure we're
>doing
>> >the
>> >> > right
>> >> > > > > thing by default. People who want backward compatibility
>> >would have
>> >> > to
>> >> > > > > change that default. But again, that's a lot of extra
and
>> >confusing
>> >> > > > > complexity that will likely be the source of bugs and
user
>> >confusion.
>> >> > > > > Having a clear, easy to reason about execution model
is
>super
>> >> > > important.
>> >> > > > >
>> >> > > > > Think about visualizing a DAG that shapeshifted 5 times
>> >during its
>> >> > > > > execution, how does anyone make sense of that?
>> >> > > > >
>> >> > > > > Max
>> >> > > > >
>> >> > > > > On Tue, Jul 28, 2020 at 3:14 AM Kaxil Naik
>> ><kaxilnaik@gmail.com>
>> >> > > wrote:
>> >> > > > >
>> >> > > > > > Thanks Max for your comments.
>> >> > > > > >
>> >> > > > > >
>> >> > > > > > *DAG Fingerprinting: *this can be tricky, especially
in
>> >regards to
>> >> > > > > dynamic
>> >> > > > > > > DAGs, where in some cases each parsing of
the DAG can
>> >result in a
>> >> > > > > > different
>> >> > > > > > > fingerprint. I think DAG and tasks attributes
are left
>> >out from
>> >> > the
>> >> > > > > > > proposal that should be considered as part
of the
>> >fingerprint,
>> >> > like
>> >> > > > > > trigger
>> >> > > > > > > rules or task start/end datetime. We should
do a full
>> >pass of all
>> >> > > DAG
>> >> > > > > > > arguments and make sure we're not forgetting
anything
>> >that can
>> >> > > change
>> >> > > > > > > scheduling logic. Also, let's be careful that
>something
>> >as simple
>> >> > > as
>> >> > > > a
>> >> > > > > > > dynamic start or end date on a task could
lead to a
>> >different
>> >> > > version
>> >> > > > > > each
>> >> > > > > > > time you parse.
>> >> > > > > >
>> >> > > > > >
>> >> > > > > >
>> >> > > > > > The short version of Dag Fingerprinting would be
>> >> > > > > > just a hash of the Serialized DAG.
>> >> > > > > >
>> >> > > > > > *Example DAG*: https://imgur.com/TVuoN3p
>> >> > > > > > *Example Serialized DAG*: https://imgur.com/LmA2Bpr
>> >> > > > > >
>> >> > > > > > It contains all the task & DAG parameters.
When they
>> >change,
>> >> > > Scheduler
>> >> > > > > > writes
>> >> > > > > > a new version of Serialized DAGs to the DB. The
>Webserver
>> >then
>> >> > reads
>> >> > > > the
>> >> > > > > > DAGs from the DB.
>> >> > > > > >
>> >> > > > > > I'd recommend limiting serialization/storage of
one
>version
>> >> > > > > > > per DAG Run, as opposed to potentially everytime
the
>DAG
>> >is
>> >> > parsed
>> >> > > -
>> >> > > > > once
>> >> > > > > > > the version for a DAG run is pinned, fingerprinting
is
>> >not
>> >> > > > re-evaluated
>> >> > > > > > > until the next DAG run is ready to get created.
>> >> > > > > >
>> >> > > > > >
>> >> > > > > > This is to handle Scenario 3 where a DAG structure
is
>> >changed
>> >> > > mid-way.
>> >> > > > > > Since we don't intend to
>> >> > > > > > change the execution behaviour, if we limit Storage
of 1
>> >version
>> >> > per
>> >> > > > DAG,
>> >> > > > > > it won't actually show what
>> >> > > > > > was run.
>> >> > > > > >
>> >> > > > > > Example Dag v1: Task A -> Task B -> Task
C
>> >> > > > > > The worker has completed the execution of Task
B and is
>> >just about
>> >> > to
>> >> > > > > > complete the execution of Task B.
>> >> > > > > >
>> >> > > > > > The 2nd version of DAG is deployed: Task A ->
Task D
>> >> > > > > > Now Scheduler queued Task D and it will run to
>completion.
>> >(Task C
>> >> > > > won't
>> >> > > > > > run)
>> >> > > > > >
>> >> > > > > > In this case, "the actual representation of the
DAG"
>that
>> >run is
>> >> > > > neither
>> >> > > > > v1
>> >> > > > > > nor v2 but a "mixed version"
>> >> > > > > >  (Task A -> Task B -> Task D). The plan is
that the
>> >Scheduler will
>> >> > > > create
>> >> > > > > > this "mixed version" based on what ran
>> >> > > > > > and the Graph View would show this "mixed version".
>> >> > > > > >
>> >> > > > > > There would also be a toggle button on the Graph
View to
>> >select v1
>> >> > or
>> >> > > > v2
>> >> > > > > > where the tasks will be highlighted to show
>> >> > > > > > that a particular task was in v1 or v2 as shown
in
>> >> > > > > >
>> >> > > > > >
>> >> > > > >
>> >> > > >
>> >> > >
>> >> >
>>
>>https://cwiki.apache.org/confluence/download/attachments/158868919/Picture%201.png?version=2&modificationDate=1595612863000&api=v2
>> >> > > > > >
>> >> > > > > >
>> >> > > > > >
>> >> > > > > > *Visualizing change in the tree view:* I think
this is
>very
>> >complex
>> >> > > and
>> >> > > > > > > many things can make this view impossible
to render
>(task
>> >> > > dependency
>> >> > > > > > > reversal, cycles across versions, ...). Maybe
a better
>> >visual
>> >> > > > approach
>> >> > > > > > > would be to render independent, individual
tree views
>for
>> >each
>> >> > DAG
>> >> > > > > > version
>> >> > > > > > > (side by side), and doing best effort aligning
the
>tasks
>> >across
>> >> > > > blocks
>> >> > > > > > and
>> >> > > > > > > "linking" tasks with lines across blocks when
>necessary.
>> >> > > > > >
>> >> > > > > >
>> >> > > > > > Agreed, the plan is to do the best effort aligning.
>> >> > > > > > At this point in time, task additions to the end
of the
>DAG
>> >are
>> >> > > > expected
>> >> > > > > to
>> >> > > > > > be compatible,
>> >> > > > > > but changes to task structure within the DAG may
cause
>the
>> >tree
>> >> > view
>> >> > > > not
>> >> > > > > to
>> >> > > > > > incorporate “old” and “new” in the same
view, hence that
>> >won't be
>> >> > > > shown.
>> >> > > > > >
>> >> > > > > > Regards,
>> >> > > > > > Kaxil
>> >> > > > > >
>> >> > > > > > On Mon, Jul 27, 2020 at 6:02 PM Maxime Beauchemin
<
>> >> > > > > > maximebeauchemin@gmail.com> wrote:
>> >> > > > > >
>> >> > > > > > > Some notes and ideas:
>> >> > > > > > >
>> >> > > > > > > *DAG Fingerprinting: *this can be tricky,
especially
>in
>> >regards
>> >> > to
>> >> > > > > > dynamic
>> >> > > > > > > DAGs, where in some cases each parsing of
the DAG can
>> >result in a
>> >> > > > > > different
>> >> > > > > > > fingerprint. I think DAG and tasks attributes
are left
>> >out from
>> >> > the
>> >> > > > > > > proposal that should be considered as part
of the
>> >fingerprint,
>> >> > like
>> >> > > > > > trigger
>> >> > > > > > > rules or task start/end datetime. We should
do a full
>> >pass of all
>> >> > > DAG
>> >> > > > > > > arguments and make sure we're not forgetting
anything
>> >that can
>> >> > > change
>> >> > > > > > > scheduling logic. Also, let's be careful that
>something
>> >as simple
>> >> > > as
>> >> > > > a
>> >> > > > > > > dynamic start or end date on a task could
lead to a
>> >different
>> >> > > version
>> >> > > > > > each
>> >> > > > > > > time you parse. I'd recommend limiting
>> >serialization/storage of
>> >> > one
>> >> > > > > > version
>> >> > > > > > > per DAG Run, as opposed to potentially everytime
the
>DAG
>> >is
>> >> > parsed
>> >> > > -
>> >> > > > > once
>> >> > > > > > > the version for a DAG run is pinned, fingerprinting
is
>> >not
>> >> > > > re-evaluated
>> >> > > > > > > until the next DAG run is ready to get created.
>> >> > > > > > >
>> >> > > > > > > *Visualizing change in the tree view:* I think
this is
>> >very
>> >> > complex
>> >> > > > and
>> >> > > > > > > many things can make this view impossible
to render
>(task
>> >> > > dependency
>> >> > > > > > > reversal, cycles across versions, ...). Maybe
a better
>> >visual
>> >> > > > approach
>> >> > > > > > > would be to render independent, individual
tree views
>for
>> >each
>> >> > DAG
>> >> > > > > > version
>> >> > > > > > > (side by side), and doing best effort aligning
the
>tasks
>> >across
>> >> > > > blocks
>> >> > > > > > and
>> >> > > > > > > "linking" tasks with lines across blocks when
>necessary.
>> >> > > > > > >
>> >> > > > > > > On Fri, Jul 24, 2020 at 12:46 PM Vikram Koka
<
>> >> > vikram@astronomer.io
>> >> > > >
>> >> > > > > > wrote:
>> >> > > > > > >
>> >> > > > > > > > Team,
>> >> > > > > > > >
>> >> > > > > > > >
>> >> > > > > > > >
>> >> > > > > > > > We just created 'AIP-36 DAG Versioning'
on
>Confluence
>> >and would
>> >> > > > very
>> >> > > > > > much
>> >> > > > > > > > appreciate feedback and suggestions from
the
>community.
>> >> > > > > > > >
>> >> > > > > > > >
>> >> > > > > > > >
>> >> > > > > > > >
>> >> > > > > > >
>> >> > > > > >
>> >> > > > >
>> >> > > >
>> >> > >
>> >> >
>>
>>https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-36+DAG+Versioning
>> >> > > > > > > >
>> >> > > > > > > >
>> >> > > > > > > >
>> >> > > > > > > > The DAG Versioning concept has been discussed
on
>> >multiple
>> >> > > occasions
>> >> > > > > in
>> >> > > > > > > the
>> >> > > > > > > > past and has been a topic highlighted
as part of
>> >Airflow 2.0 as
>> >> > > > well.
>> >> > > > > > We
>> >> > > > > > > at
>> >> > > > > > > > Astronomer have heard data engineers
at several
>> >enterprises ask
>> >> > > > about
>> >> > > > > > > this
>> >> > > > > > > > feature as well, for easier debugging
when changes
>are
>> >made to
>> >> > > DAGs
>> >> > > > > as
>> >> > > > > > a
>> >> > > > > > > > result of evolving business needs.
>> >> > > > > > > >
>> >> > > > > > > >
>> >> > > > > > > > As described in the AIP, we have a proposal
focused
>on
>> >ensuring
>> >> > > > that
>> >> > > > > > the
>> >> > > > > > > > visibility behaviour of Airflow is correct,
without
>> >changing
>> >> > the
>> >> > > > > > > execution
>> >> > > > > > > > behaviour. We considered changing the
execution
>> >behaviour as
>> >> > > well,
>> >> > > > > but
>> >> > > > > > > > decided that the risks in changing execution
>behavior
>> >were too
>> >> > > high
>> >> > > > > as
>> >> > > > > > > > compared to the benefits and therefore
decided to
>limit
>> >the
>> >> > scope
>> >> > > > to
>> >> > > > > > only
>> >> > > > > > > > making sure that the visibility was correct.
>> >> > > > > > > >
>> >> > > > > > > >
>> >> > > > > > > > We would like to attempt this based on
our
>experience
>> >running
>> >> > > > Airflow
>> >> > > > > > as
>> >> > > > > > > a
>> >> > > > > > > > service. We believe that this benefits
Airflow as a
>> >project and
>> >> > > the
>> >> > > > > > > > development experience of data engineers
using
>Airflow
>> >across
>> >> > the
>> >> > > > > > world.
>> >> > > > > > > >
>> >> > > > > > > >
>> >> > > > > > > >  Any feedback, suggestions, and comments
would be
>> >greatly
>> >> > > > > appreciated.
>> >> > > > > > > >
>> >> > > > > > > >
>> >> > > > > > > >
>> >> > > > > > > > Best Regards,
>> >> > > > > > > >
>> >> > > > > > > >
>> >> > > > > > > > Kaxil Naik, Ryan Hamilton, Ash Berlin-Taylor,
and
>> >Vikram Koka
>> >> > > > > > > >
>> >> > > > > > >
>> >> > > > > >
>> >> > > > >
>> >> > > >
>> >> > >
>> >> > >
>> >> > > --
>> >> > >
>> >> > > Jacob Ward    |    Graduate Data Infrastructure Engineer
>> >> > >
>> >> > > jward@brandwatch.com
>> >> > >
>> >> > >
>> >> > > NEW YORK   | BOSTON   | BRIGHTON   | LONDON   | BERLIN |
>> >STUTTGART |
>> >> > > PARIS   | SINGAPORE | SYDNEY
>> >> > >
>> >> >
>> >
>> >
>> >
>> >--
>> >
>> >Jarek Potiuk
>> >Polidea | Principal Software Engineer
>> >
>> >M: +48 660 796 129

Mime
  • Unnamed multipart/alternative (inline, 7-Bit, 0 bytes)
View raw message