I am not.  I continued googling for a bit after sending my email and stumbled upon a slide deck by Brian Bende.  I think my initial concern looking at it is that it seems to require schema knowledge.

For most of our data sets, we operate in a space where we have a handful of guaranteed fields and who knows what other fields the upstream provider is going to send us.  We want to operate on the data in a manner that is non-destructive to unanticipated fields.  Is that a blocker for using the RecordReader stuff?

On Fri, Aug 10, 2018 at 4:30 PM Joe Witt <joe.witt@gmail.com> wrote:

are you familiar with the record readers, writers, and associated processors?

i suspect if you make a record writer for your custom format at the end of the flow chain youll get great performance and control.


On Fri, Aug 10, 2018, 4:27 PM Benjamin Janssen <bjanssen1@gmail.com> wrote:
All, I'm seeking some advice on best practices for dealing with FlowFiles that contain a large volume of JSON records.

My flow works like this:

Receive a FlowFile with millions of JSON records in it.

Potentially filter out some of the records based on the value of the JSON fields.  (custom processor uses a regex and a json path to produce a "matched" and "not matched" output path)

Potentially split the FlowFile into multiple FlowFiles based on the value of one of the JSON fields (custom processor uses a json path and groups into output FlowFiles based on the value).

Potentially split the FlowFile into uniformly sized smaller chunks to prevent choking downstream systems on the file size (we use SplitText when the data is newline delimited, don't currently have a way when the data is a JSON array of records)

Strip out some of the JSON fields (using a JoltTransformJSON).

At the end, wrap each JSON record in a proprietary format (custom processor wraps each JSON record)

This flow is roughly similar across several different unrelated data sets.

The input data files are occasionally provided in a single JSON array and occasionally as newline delimited JSON records.  In general, we've found newline delimited JSON records far easier to work with because we can process them one at a time without loading the entire FlowFile into memory (which we have to do for the array variant).

However, if we are to use JoltTransformJSON to strip out or modify some of the JSON contents, it appears to only operate on an array (which is problematic from the memory footprint standpoint).

We don't really want to break our FlowFiles up into individual JSON records as the number of FlowFiles the system would have to handle would be orders of magnitudes larger than it is now.

Is our approach of moving towards newline delimited JSON a good one?  If so, is there anything that would be recommended for replacing JoltTransformJSON?  Or should we build a custom processor?  Or is this a reasonable feature request for the JoltTransformJSON processor to support new line delimited json?

Or should we be looking into ways to do lazy loading of the JSON arrays in our custom processors (I have no clue how easy or hard this would be to do)?  My little bit of googling suggests this would be difficult.