nifi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joe Witt <>
Subject Re: Best practices for handling large files
Date Wed, 07 Jun 2017 00:18:23 GMT

A great advance that has occured with the Apache NiFi 1.2.0 release is
support for record readers/writers (controller services) and a set of
processors that leverage them.  This allows for far more efficient
processing and for many cases completely eliminates the past needs to
split down to single event flow files.  Definitely worth a look.  Here
is a blog from today that highlights it a bit.  Happy to talk through
your case with you to help see how it can be done using this method.
I've got a flow running now where each box is able to run SQL queries
against record streams at a rate of several hundred events/sec with
full content archive/provenance turned on with live indexing.  Far
more efficient than the previous approach.


On Tue, Jun 6, 2017 at 7:23 PM, Mike Thomsen <> wrote:
> Thanks, that's actually what I ended up doing. In case anyone comes along
> looking for this. The approach we used for development was:
> GetFile -> SplitText (50k chunks) -> SplitText (1 line/flowfile) -> the rest
> On Fri, Apr 7, 2017 at 1:11 PM, Andy LoPresto <> wrote:
>> Mike,
>> Are the files a single coherent piece of information (i.e. a video file)
>> or collections of smaller atomic units of data (i.e. CSV, JSON batches)? In
>> the first case, it’s important to ensure that the processors which deal with
>> the content do so in a streaming manner so as not to exhaust your heap (and
>> ensure any customer processors you develop do the same), and and with the
>> latter, when splitting and merging these records, we generally propose a
>> two-step approach, where a single giant file is split into medium size
>> flowfiles, and then each of these is split into individual records (i.e. 1 *
>> 1MM -> 10 * 100K -> 10 * 100K * 1 as opposed to 1 * 1MM -> 1MM * 1).
>> Other than that, be sure to follow the best practices for configuration in
>> the Admin Guide [1] and read about performance expectations [2].
>> [1]
>> [2]
>> Andy LoPresto
>> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
>> On Apr 7, 2017, at 5:26 AM, Mike Thomsen <> wrote:
>> I have one flow that will have to handle files that are anywhere from
>> 500mb to several GB in size. The current plan is to store the in HDFS or S3
>> and then bring them down for processing in NiFi. Are there any suggestions
>> on how to handle such large single files?
>> Thanks,
>> Mike

View raw message