From Richard Miskin <>
Subject Use of List<String> in StandardProvenanceEventRecord.Builder
Date Fri, 15 Apr 2016 15:57:04 GMT

I’m trying to track down a performance problem that I’ve spotted with a custom NiFi processor
that I’ve written. When triggered by an incoming FlowFile, the processor loads many (up
to about 500,000) records from a database and produces an output file in a custom format.
I’m trying to leverage NiFi provenance to track what has gone into the merged file, so the
processor creates individual FlowFiles for each database record parented from the incoming
FlowFile and with various attributes set. The output FlowFile is then created as a merge of
all the database record FlowFiles.

As I don’t require the individual database record FlowFiles outside the processor I call
session.remove(Collection<FlowFile>) rather than transferring them. This works fine
for small numbers of records, but the call to remove gets very slow as the number of FlowFiles
increases, taking over a minute for 100,000 records.

I need to do some further testing be sure of the cause, but looking through the code I see
that StandardProvenanceEventRecord.Builder contains a List<String> to hold the child
uuids. The call to session.remove() eventually calls down to List.remove(), which will get
progressively slower as the List grows.

Given the entries in the List<String> are uuids, could this reasonably be changed to
be a Set<String>? Presumably there should never be duplicates, but does the order of
entries matter?

