nifi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Payne <marka...@hotmail.com>
Subject Re: How to optimise use of MergeContent for large number of bins.
Date Fri, 21 Sep 2018 13:52:52 GMT
Ashwin,

You should have no problem here in terms of MergeContent scaling. But there are
a few things that you'll want to consider:

1. If you have a cluster of these, you're going to be merging FlowFiles per node. So
you'll need to ensure that you have enough data on each node to reach your max
of 500 / 200. And you'll want to ensure that you have a timeout set in case you don't
fill the bin exactly.

2. The first MergeContent will need to hold (in Java heap) 500 FlowFiles per bin *
2000 bins = 1 million FlowFiles. The second MergeContent will hold up to 400,000
more FlowFiles. So you'll need a decent size heap to avoid Out of Memory Error
(probably 8 GB). If you still have issues with Java heap you may want to even consider
adding a third MergeContent that does 100 msgs/bin, then a second that does 100 msgs/bin
and a third that does 10 msgs/bin. This may not be necessary, but it's something to consider
if you want to avoid OOME and don't want to or can't allocate a huge Java heap. The good
news here is that the amount of data you're actually merging is not that much, it's just a
lot
of FlowFiles. So MergeContent should be pretty efficient.

I hope this is helpful!

Thanks
-Mark


> On Sep 21, 2018, at 8:32 AM, ashwin konale <ashwin.konale@gmail.com> wrote:
> 
> I have a nifi workflow to read from multiple mysql binlogs and put it to
> hdfs. I am using ChangeDataCapture as a source and PutHdfs as sink. I am
> using MergeContent processor in between to chunk the messages together for
> hdfs.
> 
> *[CDC (Primary node only. About 200 of this processor for each db)]
> *-> *UpdateAttributes(db-table-hour)
> -> MergeContent 500 msg(bin based on db-table-hour) ->MergeContent 200
> msg(bin based on db-table-hour) -> Put hdfs*
> 
> I have about ~200 databases to read from, And ~2500 tables altogether.
> Update rate for binlogs is around 1Mbps per database. I am planning to run
> this on 3 node nifi cluster as of now.
> 
> Has anyone used mergecontent with more than 2000 bins before? Does it scale
> well.? Can someone suggest me any improvements to the workflow or
> alternatives.
> 
> Thanks
> Ashwin


Mime
View raw message