nifi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Richard Miskin <>
Subject Re: Use of List<String> in StandardProvenanceEventRecord.Builder
Date Fri, 15 Apr 2016 17:45:18 GMT
Hi Mark,

Thanks for pointer, I’d not spotted I was losing my provenance information. I’d changed
my code from transferring the temporary FlowFiles to a relationship that was auto-terminated
to using session.remove() and had assumed that the provenance report was the same. I’ve
just tested it and you’re quite right, using session.remove() discards the provenance information.

Heap usage has been an issue, but seems to be okay at present, admittedly with several GB
heap allocated.

I did look at combining the files using one processor to load the data and then using MergeContent
to combine them. But every record loaded due to a specific request must be combined into a
single file and I couldn’t find a suitable way of guaranteeing that with MergeContent.

Thanks for your help,

> On 15 Apr 2016, at 17:10, Mark Payne <> wrote:
> Richard,
> So the order of the children may be important for some people. It certainly is reasonable
to care
> about the order in which the children were created.
> The larger concern, though, would be that if we moved to a Set such as HashSet, the difference
> in the amount of heap consumed is pretty remarkably different. Since this collection
is sometimes
> quite large, a Set would be potentially problematic.
> That said, with the approach that you are taking, I don't think you're going to get the
result that
> you are looking for, because as you remove the FlowFiles, the events generated for them
> also removed. So you won't end up getting any Provenance events anyway.
> One possible way to achieve what you are looking for is to instead emit each of those
> individually and then use a MergeContent processor to merge the FlowFiles back together.
> Using this approach, though, you will certainly run into heap concerns if you are trying
to merge
> 500,000 FlowFiles in a single iterations. Typically, the approach that we would follow
is to merge
> say 10,000 FlowFiles at a time and then have a subsequent MergeContent that would merge
> together 50 of those 10,000-FlowFile-bundles.
> Thanks
> -Mark
>> On Apr 15, 2016, at 11:57 AM, Richard Miskin <> wrote:
>> Hi,
>> I’m trying to track down a performance problem that I’ve spotted with a custom
NiFi processor that I’ve written. When triggered by an incoming FlowFile, the processor
loads many (up to about 500,000) records from a database and produces an output file in a
custom format. I’m trying to leverage NiFi provenance to track what has gone into the merged
file, so the processor creates individual FlowFiles for each database record parented from
the incoming FlowFile and with various attributes set. The output FlowFile is then created
as a merge of all the database record FlowFiles.
>> As I don’t require the individual database record FlowFiles outside the processor
I call session.remove(Collection<FlowFile>) rather than transferring them. This works
fine for small numbers of records, but the call to remove gets very slow as the number of
FlowFiles increases, taking over a minute for 100,000 records.
>> I need to do some further testing be sure of the cause, but looking through the code
I see that StandardProvenanceEventRecord.Builder contains a List<String> to hold the
child uuids. The call to session.remove() eventually calls down to List.remove(), which will
get progressively slower as the List grows.
>> Given the entries in the List<String> are uuids, could this reasonably be changed
to be a Set<String>? Presumably there should never be duplicates, but does the order
of entries matter?
>> Regards,
>> Richard

View raw message