nifi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Edenfield, Orrin" <orrin.edenfi...@prgx.com>
Subject Re: Multiple dataflows with sub-flows and version control
Date Fri, 02 Jan 2015 21:37:40 GMT
Mark, Joe, &  Joe - Thank you for the info and examples - I will still need to re-read
and disgest some of this but this really helps me get a better idea of what is possible here.


I will keep thinking and email again if I have any other thoughts/questions/ideas. 

Really cool stuff here!

Orrin Edenfield 

> On Jan 2, 2015, at 3:33 PM, Joe Gresock <jgresock@gmail.com> wrote:
> 
> Orrin,
> 
> I've also found that a great deal of "individual data flows" can often be
> handled using some basic RouteOnAttribute and UpdateAttributes patterns in
> NiFi.  These two processors alone are extremely powerful in reducing flow
> sizes.  Joe W speaks the truth when he talks about being able to visualize
> bad patterns, especially in code reuse.  My team has found that often what
> appears to be multiple different flows turns out to be slightly different
> uses of the same basic flow, and we've been able to reduce the number of
> processors by orders of magnitude with careful study.  I wouldn't be
> surprised to find that your 300-processor use case can be reduced as well,
> but 300 is actually quite manageable in NiFi with Processor Groups, coupled
> with the search bar feature (upper right).
> 
> I think Joe W said it best, but I just wanted to confirm that flow
> reduction is really something that happens in practice.
> 
>> On Fri, Jan 2, 2015 at 3:05 PM, Joe Witt <joe.witt@gmail.com> wrote:
>> 
>> Great!
>> 
>> "more documents ..."
>> 
>> Oh yeah - we're working hard on that too.  The current user guide draft can
>> be found here:
>> 
>> http://nifi.incubator.apache.org/docs/nifi-docs/user-guide.html
>> 
>> And if you build the latest 'develop' branch that is integrated into the
>> app as well as the initial stab at our expression language.  We're still a
>> ways behind the ball on docs though and it is a mjaor focus area.  We will
>> get examples out as well.  That is one really nice thing about our
>> 'Templates' feature.  Examples can be easily imported too.
>> 
>> Thanks and have a great weekend
>> 
>> Joe
>> 
>> On Fri, Jan 2, 2015 at 3:02 PM, Edenfield, Orrin <orrin.edenfield@prgx.com
>> wrote:
>> 
>>> Joe,
>>> 
>>> Thank you for taking the time to detail this out for me.  This is a
>>> different way of thinking for me but I think I'm starting to get it.  I
>>> work with a data factory that uses an ETL tool that would take about 300
>> to
>>> 600 individual flows (closer to the 300 side if we can
>> parameterize/re-use
>>> pieces of flows) and would literally be thousands of processors - if we
>>> solved it the same way we solve with traditional ETL tools.
>>> 
>>> I'll try to think some more over the weekend but you're probably right
>>> that with the full use of these components that could be quickly
>> compacted
>>> into a much smaller footprint when it comes to actual needed data flow.
>>> 
>>> I know things are still getting started here with incubation but if there
>>> are any documents/more examples I can read up on when it comes to things
>>> like Process Groups - I think that would help me see if I can fully wrap
>> my
>>> head around this when it comes to applying this to my world.  :-)
>>> 
>>> And just let me know if there is anything I can do to help - I'm excited
>>> about the possibilities of this tool!
>>> 
>>> Thank you.
>>> 
>>> --
>>> Orrin Edenfield
>>> Associate Architect - PRGX USA, Inc.
>>> Orrin.Edenfield@prgx.com
>>> 
>>> -----Original Message-----
>>> From: Joe Witt [mailto:joe.witt@gmail.com]
>>> Sent: Friday, January 02, 2015 2:26 PM
>>> To: dev@nifi.incubator.apache.org
>>> Subject: Re: Multiple dataflows with sub-flows and version control
>>> 
>>> Orrin,
>>> 
>>> You definitely bring up a good point.  I believe though the point is
>> about
>>> the inherent complexity that exists when you have large-scale dataflows
>> and
>>> large number of them at that.
>>> 
>>> What NiFi allows you to do is manage the complexity visually, in
>>> real-time, and all across the desired spectrum of granularity.  One
>>> potentially convenient way to think about it is this:
>>> 
>>> When you're writing code and you identify a new abstraction that would
>>> make things cleaner and more logical you start to refactor.  You do this
>> to
>>> make your code more elegant, more efficient, more maintainable and to
>>> manage complexity.  In NiFi you do exactly that.  As you're growing
>> toward
>>> hundreds or thousands of processors you identify patterns that reveal
>>> themselves visually.  That is a great way to communicate concepts not
>> just
>>> for the original author but for others as well.  As you build flows bad
>>> ideas tend to become obvious and more importantly easy to deal with.  The
>>> key thing though is that you don't have long arduous off-line improvement
>>> cycles which tend to cause folks to avoid solving the root problem and
>> thus
>>> they accrue tech debt.  With NiFi you just start making improvements to
>> the
>>> flow while everything is running.  You get immediate feedback on whether
>>> what you're doing is correct or not.  You can experiment in production
>> but
>>> outside the production flow if necessary by doing a super efficient tee
>> of
>>> the flow.  It really is a very different way of approaching a very old
>>> problem.
>>> 
>>> It's cool that you're seeing ETL cases for it.  If there are details of
>>> that which you can share we'd love to hear them.  I don't know if the
>> sweet
>>> spot is there or not.  We'll have to see what the community finds and how
>>> that evolves over time.  I will say for new NiFi users it is extremely
>>> common for them to think of a bunch of independent dataflow graphs which
>>> are basically a lot of independent linear graphs.  Then over time as they
>>> start to understand more about what it enables they start thinking in
>>> directed graphs and how to merge flows and establish reusable components
>>> and so on.  Curious to see how that maps to your experience.
>>> 
>>> As for the check-in of the flow configuration to a source control system
>>> you can certainly do that.  You could programmatically invoke our
>> endpoint
>>> which causes NiFi to make a backup of the flow and then put that in
>> source
>>> control on some time interval.  But keep in mind that is just like
>> taking a
>>> picture of what the 'flow looks like'.  NiFi is more than the picture of
>>> the flow.  It is the picture of the flow and the state of the data within
>>> it and so on.
>>> 
>>> Very interested to hear more of your thoughts as you look further and
>>> think more about it.  You'll be a great help to us to better understand
>> how
>>> to communicate about it to folks coming from an ETL background.
>> Ultimately
>>> it
>>> would be great if we get you to help us do that with us ;-)   Please
>> don't
>>> be shy letting us know you're expectations.  We're new here too.
>>> 
>>> Thanks
>>> Joe
>>> 
>>> 
>>> 
>>> On Fri, Jan 2, 2015 at 1:50 PM, Edenfield, Orrin <
>> orrin.edenfield@prgx.com
>>> wrote:
>>> 
>>>> Mark,
>>>> 
>>>> I follow the logic here I just think over time it will be really hard
>>>> to keep track of things when there are hundreds (or thousands) of
>>>> processors - rather than hundreds of different flows (organized within
>>>> a source control tree or similar) that all have 5-50 different
>>> processors within them.
>>>> 
>>>> I'd be interested to learn about how the Process Groups component
>>>> works so if you do get time to draw an example I think that would be
>>> helpful.
>>>> 
>>>> Thank you.
>>>> 
>>>> --
>>>> Orrin Edenfield
>>>> 
>>>> -----Original Message-----
>>>> From: Mark Payne [mailto:markap14@hotmail.com]
>>>> Sent: Friday, January 02, 2015 12:34 PM
>>>> To: dev@nifi.incubator.apache.org; Edenfield, Orrin
>>>> Subject: Re: Multiple dataflows with sub-flows and version control
>>>> 
>>>> Orrin,
>>>> 
>>>> Within NiFi you can create many different dataflows within the same
>>>> graph and run them concurrently. We've built flows with several
>>>> hundred Processors. They data can flow between flows by simply
>>>> connecting the Processors together.
>>>> 
>>>> If you want to separate the flows logically because it makes more
>>>> sense to you to visualize them that way, you may want to use Process
>>> Groups.
>>>> 
>>>> I'm on my cell phone right now so I cannot draw up an example for you
>>>> but I will this afternoon when I have a chance. But the basic idea is
>>>> that for
>>>> #1 you would have:
>>>> 
>>>> GetFile -> PutHDFS
>>>> 
>>>> And along side that another GetFile -> CompressContent -> the same
>>> PutHDFS.
>>>> 
>>>> In this case you can even do this with the following flow:
>>>> 
>>>> GetFile -> IdentifyMimeType (to check if compressed) ->
>>>> CompressContent (set to decompress and the compression type come from
>>>> mime type, which is identified by the previous processor) -> PutHDFS
>>>> 
>>>> With regards to #2:
>>>> You can build the new flow right along side the old flow. When you are
>>>> ready to switch, simply change the connection to send data to the new
>>>> flow instead of the old one.
>>>> 
>>>> Again, I'll put together some examples this afternoon with screen
>>>> shots that should help. Let me know if this helps or if it creates
>>>> more questions (or both :))
>>>> 
>>>> Thanks
>>>> -Mark
>>>> 
>>>> 
>>>> 
>>>> Sent from my iPhone
>>>> 
>>>>> On Jan 2, 2015, at 11:37 AM, Edenfield, Orrin
>>>>> <orrin.edenfield@prgx.com>
>>>> wrote:
>>>>> 
>>>>> Hello everyone - I'm new to the mailing list and I've tried to
>>>>> search
>>>> the JIRA and mailing list to see if this has already been addressed
>>>> and didn't find anything so here it goes:
>>>>> 
>>>>> When I think about the capabilities of this tool I instantly think
>>>>> of
>>>> ETL-type tools. So the questions/comments below are likely to be
>>>> coming from that frame of mind - let me know if I've misunderstood a
>>>> key concept of NiFi as I think that could be possible.
>>>>> 
>>>>> Is it possible to have NiFi service setup and running and allow for
>>>> multiple dataflows to be designed and deployed (running) at the same
>>> time?
>>>> So far in my testing I've found that I can get NiFi service up and
>>>> functioning as expected on my cluster edge node but I'd like to be
>>>> able to design multiple dataflows for the following reasons.
>>>>> 
>>>>> 1. I have many datasets that will need some of the same flow actions
>>>>> but
>>>> not all of them. I'd like to componentize the flows and possibly have
>>>> multiple flows cascade from one to another. For example:  I will want
>>>> all data to flow into an HDFS endpoint but dataset1 will be coming in
>>>> as delimited data so it can go directly into the GetFile processor
>>>> while I need dataset2 to go through a CompressContent processor first.
>>>>> 
>>>>> 2. Because I have a need in #1 above - I'd like to be able to design
>>>> multiple flows (specific to a data need or component flows that work
>>>> together) and have them all be able to be deployed (running)
>>> concurrently.
>>>>> 
>>>>> Also - it would be nice to be able to version control these designed
>>>> flows so I can have 1 flow running while modifying a version 2.0 of
>>>> that flow and then once the updates have been made then I can safely
>>>> and effectively have a mechanism to shut down flow.v1 and start up
>>> flow.v2.
>>>>> 
>>>>> Thank you.
>>>>> 
>>>>> --
>>>>> Orrin Edenfield
> 
> 
> 
> --
> I know what it is to be in need, and I know what it is to have plenty.  I
> have learned the secret of being content in any and every situation,
> whether well fed or hungry, whether living in plenty or in want.  I can do
> all this through him who gives me strength.    *-Philippians 4:12-13*

Mime
View raw message