nifi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Edenfield, Orrin" <orrin.edenfi...@prgx.com>
Subject RE: Multiple dataflows with sub-flows and version control
Date Fri, 02 Jan 2015 20:02:08 GMT
Joe,

Thank you for taking the time to detail this out for me.  This is a different way of thinking
for me but I think I'm starting to get it.  I work with a data factory that uses an ETL tool
that would take about 300 to 600 individual flows (closer to the 300 side if we can parameterize/re-use
pieces of flows) and would literally be thousands of processors - if we solved it the same
way we solve with traditional ETL tools.

I'll try to think some more over the weekend but you're probably right that with the full
use of these components that could be quickly compacted into a much smaller footprint when
it comes to actual needed data flow.  

I know things are still getting started here with incubation but if there are any documents/more
examples I can read up on when it comes to things like Process Groups - I think that would
help me see if I can fully wrap my head around this when it comes to applying this to my world.
 :-)

And just let me know if there is anything I can do to help - I'm excited about the possibilities
of this tool!

Thank you.

--
Orrin Edenfield
Associate Architect - PRGX USA, Inc.
Orrin.Edenfield@prgx.com

-----Original Message-----
From: Joe Witt [mailto:joe.witt@gmail.com] 
Sent: Friday, January 02, 2015 2:26 PM
To: dev@nifi.incubator.apache.org
Subject: Re: Multiple dataflows with sub-flows and version control

Orrin,

You definitely bring up a good point.  I believe though the point is about the inherent complexity
that exists when you have large-scale dataflows and large number of them at that.

What NiFi allows you to do is manage the complexity visually, in real-time, and all across
the desired spectrum of granularity.  One potentially convenient way to think about it is
this:

When you're writing code and you identify a new abstraction that would make things cleaner
and more logical you start to refactor.  You do this to make your code more elegant, more
efficient, more maintainable and to manage complexity.  In NiFi you do exactly that.  As you're
growing toward hundreds or thousands of processors you identify patterns that reveal themselves
visually.  That is a great way to communicate concepts not just for the original author but
for others as well.  As you build flows bad ideas tend to become obvious and more importantly
easy to deal with.  The key thing though is that you don't have long arduous off-line improvement
cycles which tend to cause folks to avoid solving the root problem and thus they accrue tech
debt.  With NiFi you just start making improvements to the flow while everything is running.
 You get immediate feedback on whether what you're doing is correct or not.  You can experiment
in production but outside the production flow if necessary by doing a super efficient tee
of the flow.  It really is a very different way of approaching a very old problem.

It's cool that you're seeing ETL cases for it.  If there are details of that which you can
share we'd love to hear them.  I don't know if the sweet spot is there or not.  We'll have
to see what the community finds and how that evolves over time.  I will say for new NiFi users
it is extremely common for them to think of a bunch of independent dataflow graphs which are
basically a lot of independent linear graphs.  Then over time as they start to understand
more about what it enables they start thinking in directed graphs and how to merge flows and
establish reusable components and so on.  Curious to see how that maps to your experience.

As for the check-in of the flow configuration to a source control system you can certainly
do that.  You could programmatically invoke our endpoint which causes NiFi to make a backup
of the flow and then put that in source control on some time interval.  But keep in mind that
is just like taking a picture of what the 'flow looks like'.  NiFi is more than the picture
of the flow.  It is the picture of the flow and the state of the data within it and so on.

Very interested to hear more of your thoughts as you look further and think more about it.
 You'll be a great help to us to better understand how to communicate about it to folks coming
from an ETL background.  Ultimately it
would be great if we get you to help us do that with us ;-)   Please don't
be shy letting us know you're expectations.  We're new here too.

Thanks
Joe



On Fri, Jan 2, 2015 at 1:50 PM, Edenfield, Orrin <orrin.edenfield@prgx.com>
wrote:

> Mark,
>
> I follow the logic here I just think over time it will be really hard 
> to keep track of things when there are hundreds (or thousands) of 
> processors - rather than hundreds of different flows (organized within 
> a source control tree or similar) that all have 5-50 different processors within them.
>
> I'd be interested to learn about how the Process Groups component 
> works so if you do get time to draw an example I think that would be helpful.
>
> Thank you.
>
> --
> Orrin Edenfield
>
> -----Original Message-----
> From: Mark Payne [mailto:markap14@hotmail.com]
> Sent: Friday, January 02, 2015 12:34 PM
> To: dev@nifi.incubator.apache.org; Edenfield, Orrin
> Subject: Re: Multiple dataflows with sub-flows and version control
>
> Orrin,
>
> Within NiFi you can create many different dataflows within the same 
> graph and run them concurrently. We've built flows with several 
> hundred Processors. They data can flow between flows by simply 
> connecting the Processors together.
>
> If you want to separate the flows logically because it makes more 
> sense to you to visualize them that way, you may want to use Process Groups.
>
> I'm on my cell phone right now so I cannot draw up an example for you 
> but I will this afternoon when I have a chance. But the basic idea is 
> that for
> #1 you would have:
>
> GetFile -> PutHDFS
>
> And along side that another GetFile -> CompressContent -> the same PutHDFS.
>
> In this case you can even do this with the following flow:
>
> GetFile -> IdentifyMimeType (to check if compressed) -> 
> CompressContent (set to decompress and the compression type come from 
> mime type, which is identified by the previous processor) -> PutHDFS
>
> With regards to #2:
> You can build the new flow right along side the old flow. When you are 
> ready to switch, simply change the connection to send data to the new 
> flow instead of the old one.
>
> Again, I'll put together some examples this afternoon with screen 
> shots that should help. Let me know if this helps or if it creates 
> more questions (or both :))
>
> Thanks
> -Mark
>
>
>
> Sent from my iPhone
>
> > On Jan 2, 2015, at 11:37 AM, Edenfield, Orrin 
> > <orrin.edenfield@prgx.com>
> wrote:
> >
> > Hello everyone - I'm new to the mailing list and I've tried to 
> > search
> the JIRA and mailing list to see if this has already been addressed 
> and didn't find anything so here it goes:
> >
> > When I think about the capabilities of this tool I instantly think 
> > of
> ETL-type tools. So the questions/comments below are likely to be 
> coming from that frame of mind - let me know if I've misunderstood a 
> key concept of NiFi as I think that could be possible.
> >
> > Is it possible to have NiFi service setup and running and allow for
> multiple dataflows to be designed and deployed (running) at the same time?
> So far in my testing I've found that I can get NiFi service up and 
> functioning as expected on my cluster edge node but I'd like to be 
> able to design multiple dataflows for the following reasons.
> >
> > 1. I have many datasets that will need some of the same flow actions 
> > but
> not all of them. I'd like to componentize the flows and possibly have 
> multiple flows cascade from one to another. For example:  I will want 
> all data to flow into an HDFS endpoint but dataset1 will be coming in 
> as delimited data so it can go directly into the GetFile processor 
> while I need dataset2 to go through a CompressContent processor first.
> >
> > 2. Because I have a need in #1 above - I'd like to be able to design
> multiple flows (specific to a data need or component flows that work
> together) and have them all be able to be deployed (running) concurrently.
> >
> > Also - it would be nice to be able to version control these designed
> flows so I can have 1 flow running while modifying a version 2.0 of 
> that flow and then once the updates have been made then I can safely 
> and effectively have a mechanism to shut down flow.v1 and start up flow.v2.
> >
> > Thank you.
> >
> > --
> > Orrin Edenfield
>
Mime
View raw message