nifi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Corey Flowers <>
Subject Re: Collaboration?
Date Tue, 17 Nov 2015 13:58:25 GMT
Good morning!

We have used it for a while in multiple groups of 6-9 people each. Usually
that is what it takes to manage a few large scale clusters. I think our
team of 9 currently support 12 clusters and our team of 6 support a single
giant cluster but they also support quite a few clouds after distribution.
A few things we have learned over time and terms we have:

1) Processor Creep - This is where members of a team just template sections
of the graph and use it again in another area of the graph or when they
build new flows, don't actually incorporate it into existing flows. This is
ok, but if you do this a lot, you end up just constantly adding processors
and never removing.  Meaning your graph only grows in size, and loses
efficiency. Sooner or later, someone has to go in and redraw the graph to
fix all the inefficiencies.

2) Sharing the graph - this isn't really a problem for us. We usually just
see who the last person to change the graph is and ask them if they are
done before adding to the graph. Once the main area of the graph is built,
adding new flows is very fast and control of the graph isn't as much of an

3) Groups hitting your provenance locally - This can be really bad
depending on how you are setup. You definitely want something like Ambari
that you post to, and to not let every customer you support reaching into
your graph to do provenance lookups. If anything, the team directly
responsible should be the ones doing the searches for those teams, don't
give the external teams access to do the searches. Our graph slowed to
crawl when we had about 50 groups each doing local provenance lookups at
the same time. It was a mess. I am not sure if this is still an issue as we
haven't beaten it up to that level in the last few releases.

4) Flow.xml.gz version control - I actually don't recommend this at all. If
you want to template sections of your graph and then version control those,
then ok but in general it has not been a good practice. I am sure there is
a good way to accomplish this but so far, we haven't found a good solution.
So lets look at an example real quick:
 Lets say you have a processor set and version control that, then you add
15 processors, run data, then decide you have to revert.
1) Your graphs for that section will have to be empty of data to revert or
you will lose data
2) The time it has taken you to test those 15 processors, there was N
number of changes to the graph (Remember you aren't just reverting your
section of the graph. It is the whole thing). How do you keep the delta of
the changes outside of your section?
3) Stopping ingest in one section (to make sure there is no data in your
section) may cause others to backup.
4) To revert back, if you are clustered, you would need to reinsert the
flow.xml.gz into the flow.tar or replace the flow.tar, remove all the
flow.xml.gz's from the cluster nodes and restart your entire cluster.

Now if you have built your flows into groups and template those groups, you
could just stop a group, import the template, start the newly inserted
template and remove the old processors, without the flow.xml.gz version

If you are doing version control from the standpoint of having a kind of
"undo", it doesn't really work. If you are doing it to have a backup in
case you lose a server, then any central storage point should work just
fine, but if you are clustered you are covered just by how it is

I am sure there are a couple of things i missed but just throwing some
lessons learned out there.


On Tue, Nov 17, 2015 at 4:20 AM, Juan Jose Escobar <> wrote:

> Hello, Darren,
> We are using Nifi collaboratively but the team of people using it is not
> yet that large, so we can coordinate without problems even in production
> clusters. In some cases, when you are performing some action on the UI you
> may get "This NiFi instance has been updated by '<user>'. Please refresh to
> synchronize the view" (a warning is also shown in the toolbar even if you
> do nothing). Hitting refresh is okeish for us for now...
> You can organize your flow in different dataflows that run concurrently.
> Flows can be independent or data can flow between them just by connecting
> processors. Note that "flows" here are just a logical division, but there
> is nothing in Nifi separating them as far as I know. You can still organize
> flows using Process Groups so that everything more structured, at least
> visually.
> For version control we keep the flow in git. You can request a backup
> using the API and move that into Git. You can have a cron job script for
> that. You may decompress the flow to have an XML format, just be strict on
> changes.
> Also, if you want to isolate flows in version control, you can save the
> specific parts as templates and put that in Git (we do that only for sub
> flows that are not part of the standard setup, kind of optional).
> Separating the templates that form the actual flow may help your team to
> work concurrently, but we are not doing that since we haven't hit
> concurrent access problems yet.
> Curious to hear about better ways to do these things...
> Regards
> On Mon, Nov 16, 2015 at 6:46 PM, darren <> wrote:
>> Hi
>> How are people using nifi collaboratively with various people designing
>> flows that may or may not be used together?
>> Also how are people doing version control of flows e.g. in git?
>> Thanks
>> D
>> Sent from my Verizon Wireless 4G LTE smartphone

Corey Flowers
Vice President, Onyx Point, Inc
(410) 541-6699

-- This account not approved for unencrypted proprietary information --

View raw message