airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jarek Potiuk <Jarek.Pot...@polidea.com>
Subject Re: Generic Transfer Operator
Date Wed, 19 Aug 2020 18:10:30 GMT
I like the idea a lot. Similar things have been discussed before but the
proposal is I think rather pragmatic and solves a real problem (and it does
not seem to be too complex to implement)

There is some discussion about it already in the document (please chime-in
for those interested) but here a few points why I like it:

- performance and optimization is not a focus for that. For generic stuff
it is usually to write "optimal" solution but once you admit you are not
going to focus for optimisation, you come with simpler and easier to use
solutions

- on the other hand - it uses very "Python'y" approach with using
Airflow's familiar concepts (connection, transfer) and has the potential of
plugging in into 100s of hooks we have already easily - leveraging all the
"providers" richness of Airflow.

- it aims to be easy to do "quick start" - if you have a number of
different sources/targets and as a data scientist you would like to quickly
start transferring data between them  - you can do it easily with only
basic python knowledge and simple DAG structure.

- it should be possible to plug it in into our new functional approach as
well as future lineage discussions as it makes connection between sources
and targets

- it opens up possibilities of adding simple and flexible data
transformation on-transfer. Not a replacement for any of the external
services that Airflow should use (Airflow is an orchestrator, not data
processing solution) but for the kind of quick-start scenarios I foresee it
might be most useful, being able to apply simple data transformation on the
fly by data scientist might be a big plus.

Suggestion: Panda DataFrame as the format of the "data" component

Kamil - you should have access now.

J.


On Tue, Aug 18, 2020 at 6:53 PM Kamil Olszewski <kamil.olszewski@polidea.com>
wrote:

> Hello all,
> in Polidea we have come up with an idea for a generic transfer operator
> that would be able to transport data between two destinations of various
> types (file, database, storage, etc.) - please find the link with a short
> doc with POC
> <
> https://docs.google.com/document/d/1o7Ph7RRNqLWkTbe7xkWjb100eFaK1Apjv27LaqHgNkE/edit?usp=sharing
> >
> where we can discuss the design initially. Once we come to the initial
> conclusion I can create an AIP on cWiki - can I ask for permission to do so
> (my id is 'kamil.olszewski')? I believe that during the discussion we
> should definitely aim for this feature to be released only after Airflow
> 2.0 is out.
>
> What do you think about this idea? Would you find such an operator helpful
> in your pipelines? Maybe you already use a similar solution or know
> packages that could be used to implement it?
>
> Best regards,
> --
>
> Kamil Olszewski
> Polidea <https://www.polidea.com> | Software Engineer
>
> M: +48 503 361 783
> E: kamil.olszewski@polidea.com
>
> Unique Tech
> Check out our projects! <https://www.polidea.com/our-work>
>


-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message