nifi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Drob <md...@apache.org>
Subject Re: Operational Deployment/Garbage Collection
Date Wed, 07 Jan 2015 17:39:21 GMT
On Wed, Jan 7, 2015 at 10:54 AM, Joe Witt <joe.witt@gmail.com> wrote:

> Mike,
>
> We hope to discuss this in much more detail as we make progress toward the
> administration guide.  But we are certainly susceptible to GC behaviors
> which can impact performance.  That is particularly true because of the
> extension points which folks can build to (processors, controller tasks,
> etc..).  We've taken great care to be as memory efficient as possible in
> all of our internal framework components and the existing standard
> processors.  In short, everything is designed to handle arbitrarily large
> objects without every loading more than some finite and relatively small
> amount of memory at once.
>
> Yea, capturing all of this in user/operator facing documentation is
probably the best end-goal. I can file a JIRA if one does not already exist.


> Where this breaks down as we currently have it is the FlowFile objects
> themselves.  For each flow file that is active in the flow we have the
> entire Map of attribute key/value String pairs loaded with the FlowFIle
> object.  So while we do not have the actual content of the flowfile in
> memory we do have those Maps and a few small values with each.  If there
> are dozens or hundreds of large keys/values across hundreds of thousands if
> not many millions of flow files then that can start to eat into heap usage
> considerably.  We do combat this fairly well with a concept called
> 'flowfile swapping'.  If a queue backlogs beyond a configurable threshold
> we actually serialize the excess flowfiles out to storage (off heap).  This
> allows for massive backlogs to be gracefully handled.  But this mechanism
> is still arguably crude as it is purely based on 'number of flow files' and
> in reality there can be great variability in the "Heap cost" of any flow
> file and that depends on the number of and size of the attributes.
>

Are there metrics kept on flow file metadata? I recall seeing # of flow
files, but it would be cool to see summary statistics on number of
attributes, memory footprint per flow file, etc. Apologies if this already
exists, I haven't gone looking yet. Maybe JMX is a good place for these.

>
> The key stressors of the heap:
> - Is it large enough for all the normal goings on in a flow?
> -- If yes great.  If no then no matter what things will be no fun..  The
> size needed depends on how many things are in the flow, how many flow files
> can be around at once, the sophistication of the processors in the flow.
>
> -- Are most objects created of a relatively short life span?  If yes
> great.  If not then it creates a different of tension on garbage
> collection.  G1 tends to handle even this fairly well but still folks
> should strive to have objects as short lived as possible.
>

> -- Are all operations against content (which could be arbitrarily large)
> done so in a manner which only ever has some finite amount in the heap at a
> time?  This is by far the single biggest gotcha we see related to garbage
> collection issues.  It is imperative that if one wants to see their JVM
> stay performant that they be very cognizant of being buffer 'stream'
> friendly rather than using byte[] to hold large objects.
>

I could come up with several scenarios (i.e. do this or that) to ask about,
but I think I'll be better served by just looking at existing processors as
exemplars. I'll come back with more questions after I've read the source.

>
> I've run with G1 very successfully for a very long time and if I write the
> documentation for this I would recommend its use.


Good to know.

>
> I've put together a couple of 'Stress Test' style templates that people can
> run on their configured system to get a sense of memory load for well
> behaved processors and framework components.  Hopefully that will help put
> some real information behind such a discussion.  We can also update the
> GenerateFlowFile processor to have what would be considered bad behaviors
> so folks can plainly see the effects of bad memory practices.
>

This is very cool. I would make the bad behaviours optional, but otherwise
that is an incredibly clever idea. I love it.

>
> Was this rambling even close to what you were looking for?
>

Yes, very informative. Thank you.

>
> Thanks
> Joe
>
> On Wed, Jan 7, 2015 at 11:38 AM, Mike Drob <mdrob@apache.org> wrote:
>
> > Are there operational guidelines somewhere on heap sizing and garbage
> > collection when deploying NiFi?
> >
> > There's a lot of common wisdom about how to avoid full GCs (which I
> assume
> > are as bad for NiFi as they are for any Java application) but I was
> curious
> > what people had experience running with.
> >
> > CMS? G1? C4? Are there recommended options to enable/disable based on how
> > NiFi runs for a smoother experience?
> >
> > Mike
> >
>

Mike

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message