nifi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benjamin Janssen <bjanss...@gmail.com>
Subject Large JSON File Best Practice Question
Date Fri, 10 Aug 2018 20:27:02 GMT
All, I'm seeking some advice on best practices for dealing with FlowFiles
that contain a large volume of JSON records.

My flow works like this:

Receive a FlowFile with millions of JSON records in it.

Potentially filter out some of the records based on the value of the JSON
fields.  (custom processor uses a regex and a json path to produce a
"matched" and "not matched" output path)

Potentially split the FlowFile into multiple FlowFiles based on the value
of one of the JSON fields (custom processor uses a json path and groups
into output FlowFiles based on the value).

Potentially split the FlowFile into uniformly sized smaller chunks to
prevent choking downstream systems on the file size (we use SplitText when
the data is newline delimited, don't currently have a way when the data is
a JSON array of records)

Strip out some of the JSON fields (using a JoltTransformJSON).

At the end, wrap each JSON record in a proprietary format (custom processor
wraps each JSON record)

This flow is roughly similar across several different unrelated data sets.

The input data files are occasionally provided in a single JSON array and
occasionally as newline delimited JSON records.  In general, we've found
newline delimited JSON records far easier to work with because we can
process them one at a time without loading the entire FlowFile into memory
(which we have to do for the array variant).

However, if we are to use JoltTransformJSON to strip out or modify some of
the JSON contents, it appears to only operate on an array (which is
problematic from the memory footprint standpoint).

We don't really want to break our FlowFiles up into individual JSON records
as the number of FlowFiles the system would have to handle would be orders
of magnitudes larger than it is now.

Is our approach of moving towards newline delimited JSON a good one?  If
so, is there anything that would be recommended for replacing
JoltTransformJSON?  Or should we build a custom processor?  Or is this a
reasonable feature request for the JoltTransformJSON processor to support
new line delimited json?

Or should we be looking into ways to do lazy loading of the JSON arrays in
our custom processors (I have no clue how easy or hard this would be to
do)?  My little bit of googling suggests this would be difficult.

Mime
View raw message