nifi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joe Witt <joe.w...@gmail.com>
Subject Re: Multiple dataflows with sub-flows and version control
Date Fri, 02 Jan 2015 19:25:49 GMT
Orrin,

You definitely bring up a good point.  I believe though the point is about
the inherent complexity that exists when you have large-scale dataflows and
large number of them at that.

What NiFi allows you to do is manage the complexity visually, in real-time,
and all across the desired spectrum of granularity.  One potentially
convenient way to think about it is this:

When you're writing code and you identify a new abstraction that would make
things cleaner and more logical you start to refactor.  You do this to make
your code more elegant, more efficient, more maintainable and to manage
complexity.  In NiFi you do exactly that.  As you're growing toward
hundreds or thousands of processors you identify patterns that reveal
themselves visually.  That is a great way to communicate concepts not just
for the original author but for others as well.  As you build flows bad
ideas tend to become obvious and more importantly easy to deal with.  The
key thing though is that you don't have long arduous off-line improvement
cycles which tend to cause folks to avoid solving the root problem and thus
they accrue tech debt.  With NiFi you just start making improvements to the
flow while everything is running.  You get immediate feedback on whether
what you're doing is correct or not.  You can experiment in production but
outside the production flow if necessary by doing a super efficient tee of
the flow.  It really is a very different way of approaching a very old
problem.

It's cool that you're seeing ETL cases for it.  If there are details of
that which you can share we'd love to hear them.  I don't know if the sweet
spot is there or not.  We'll have to see what the community finds and how
that evolves over time.  I will say for new NiFi users it is extremely
common for them to think of a bunch of independent dataflow graphs which
are basically a lot of independent linear graphs.  Then over time as they
start to understand more about what it enables they start thinking in
directed graphs and how to merge flows and establish reusable components
and so on.  Curious to see how that maps to your experience.

As for the check-in of the flow configuration to a source control system
you can certainly do that.  You could programmatically invoke our endpoint
which causes NiFi to make a backup of the flow and then put that in source
control on some time interval.  But keep in mind that is just like taking a
picture of what the 'flow looks like'.  NiFi is more than the picture of
the flow.  It is the picture of the flow and the state of the data within
it and so on.

Very interested to hear more of your thoughts as you look further and think
more about it.  You'll be a great help to us to better understand how to
communicate about it to folks coming from an ETL background.  Ultimately it
would be great if we get you to help us do that with us ;-)   Please don't
be shy letting us know you're expectations.  We're new here too.

Thanks
Joe



On Fri, Jan 2, 2015 at 1:50 PM, Edenfield, Orrin <orrin.edenfield@prgx.com>
wrote:

> Mark,
>
> I follow the logic here I just think over time it will be really hard to
> keep track of things when there are hundreds (or thousands) of processors -
> rather than hundreds of different flows (organized within a source control
> tree or similar) that all have 5-50 different processors within them.
>
> I'd be interested to learn about how the Process Groups component works so
> if you do get time to draw an example I think that would be helpful.
>
> Thank you.
>
> --
> Orrin Edenfield
>
> -----Original Message-----
> From: Mark Payne [mailto:markap14@hotmail.com]
> Sent: Friday, January 02, 2015 12:34 PM
> To: dev@nifi.incubator.apache.org; Edenfield, Orrin
> Subject: Re: Multiple dataflows with sub-flows and version control
>
> Orrin,
>
> Within NiFi you can create many different dataflows within the same graph
> and run them concurrently. We've built flows with several hundred
> Processors. They data can flow between flows by simply connecting the
> Processors together.
>
> If you want to separate the flows logically because it makes more sense to
> you to visualize them that way, you may want to use Process Groups.
>
> I'm on my cell phone right now so I cannot draw up an example for you but
> I will this afternoon when I have a chance. But the basic idea is that for
> #1 you would have:
>
> GetFile -> PutHDFS
>
> And along side that another GetFile -> CompressContent -> the same PutHDFS.
>
> In this case you can even do this with the following flow:
>
> GetFile -> IdentifyMimeType (to check if compressed) -> CompressContent
> (set to decompress and the compression type come from mime type, which is
> identified by the previous processor) -> PutHDFS
>
> With regards to #2:
> You can build the new flow right along side the old flow. When you are
> ready to switch, simply change the connection to send data to the new flow
> instead of the old one.
>
> Again, I'll put together some examples this afternoon with screen shots
> that should help. Let me know if this helps or if it creates more questions
> (or both :))
>
> Thanks
> -Mark
>
>
>
> Sent from my iPhone
>
> > On Jan 2, 2015, at 11:37 AM, Edenfield, Orrin <orrin.edenfield@prgx.com>
> wrote:
> >
> > Hello everyone - I'm new to the mailing list and I've tried to search
> the JIRA and mailing list to see if this has already been addressed and
> didn't find anything so here it goes:
> >
> > When I think about the capabilities of this tool I instantly think of
> ETL-type tools. So the questions/comments below are likely to be coming
> from that frame of mind - let me know if I've misunderstood a key concept
> of NiFi as I think that could be possible.
> >
> > Is it possible to have NiFi service setup and running and allow for
> multiple dataflows to be designed and deployed (running) at the same time?
> So far in my testing I've found that I can get NiFi service up and
> functioning as expected on my cluster edge node but I'd like to be able to
> design multiple dataflows for the following reasons.
> >
> > 1. I have many datasets that will need some of the same flow actions but
> not all of them. I'd like to componentize the flows and possibly have
> multiple flows cascade from one to another. For example:  I will want all
> data to flow into an HDFS endpoint but dataset1 will be coming in as
> delimited data so it can go directly into the GetFile processor while I
> need dataset2 to go through a CompressContent processor first.
> >
> > 2. Because I have a need in #1 above - I'd like to be able to design
> multiple flows (specific to a data need or component flows that work
> together) and have them all be able to be deployed (running) concurrently.
> >
> > Also - it would be nice to be able to version control these designed
> flows so I can have 1 flow running while modifying a version 2.0 of that
> flow and then once the updates have been made then I can safely and
> effectively have a mechanism to shut down flow.v1 and start up flow.v2.
> >
> > Thank you.
> >
> > --
> > Orrin Edenfield
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message