All, I'm seeking some advice on best practices for dealing with FlowFiles that contain a large volume of JSON records.
My flow works like this:
Receive a FlowFile with millions of JSON records in it.
Potentially filter out some of the records based on the value of the JSON fields. (custom processor uses a regex and a json path to produce a "matched" and "not matched" output path)
Potentially split the FlowFile into multiple FlowFiles based on the value of one of the JSON fields (custom processor uses a json path and groups into output FlowFiles based on the value).
Potentially split the FlowFile into uniformly sized smaller chunks to prevent choking downstream systems on the file size (we use SplitText when the data is newline delimited, don't currently have a way when the data is a JSON array of records)
Strip out some of the JSON fields (using a JoltTransformJSON).
At the end, wrap each JSON record in a proprietary format (custom processor wraps each JSON record)
This flow is roughly similar across several different unrelated data sets.
The input data files are occasionally provided in a single JSON array and occasionally as newline delimited JSON records. In general, we've found newline delimited JSON records far easier to work with because we can process them one at a time without loading the entire FlowFile into memory (which we have to do for the array variant).
However, if we are to use JoltTransformJSON to strip out or modify some of the JSON contents, it appears to only operate on an array (which is problematic from the memory footprint standpoint).
We don't really want to break our FlowFiles up into individual JSON records as the number of FlowFiles the system would have to handle would be orders of magnitudes larger than it is now.
Is our approach of moving towards newline delimited JSON a good one? If so, is there anything that would be recommended for replacing JoltTransformJSON? Or should we build a custom processor? Or is this a reasonable feature request for the JoltTransformJSON processor to support new line delimited json?
Or should we be looking into ways to do lazy loading of the JSON arrays in our custom processors (I have no clue how easy or hard this would be to do)? My little bit of googling suggests this would be difficult.