nifi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Oleg Zhurakousky <ozhurakou...@hortonworks.com>
Subject Re: Merge multiple flowfiles
Date Fri, 03 Jun 2016 12:06:49 GMT
Huagen,
I also want to apologize for my spell-checker butchering your name ;)

Cheers
Oleg

On Jun 3, 2016, at 8:03 AM, Oleg Zhurakousky <ozhurakousky@hortonworks.com<mailto:ozhurakousky@hortonworks.com>>
wrote:

Huge

Just to close the loop on this one, I also wanted to point out this JIRA https://issues.apache.org/jira/browse/NIFI-1926
for general purpose aggregation processor which indeed would support multiple connections,
configurable aggregation, release and correlation strategies.
It would be nice if you can describe your use case in that JIRA, so we can start gathering
these use cases.

Cheers
Oleg

On Jun 3, 2016, at 2:33 AM, Huagen peng <huagen.peng@gmail.com<mailto:huagen.peng@gmail.com>>
wrote:

Thanks for the reply, Andy.

I ended up abandoning my previous approach and using ExecuteStreamCommand to output (with
zcat command on GZ files) all the files I want to concatenate.  Then performing some data
manipulation and saving the file.

Huagen

在 2016年6月3日,上午12:29,Andy LoPresto <alopresto@apache.org<mailto:alopresto@apache.org>>
写道:

Huagen,

Sorry, I am a little confused. My understanding is that you want to combine n individual logs
(each with a respective flowfile) from a specific hour into a single file. What is confusing
is when you say “Even with that [a 5* confirmation loop], I occasionally still get more
than one merged flowfile.” Do you mean that what you expected to be combined into a single
flowfile is output as two distinct and incomplete flowfiles?

Without seeing a template of your work flow, I can make a couple of suggestions.

First, as mentioned last night by James Wing, I would encourage you to look at the MergeContent
[1] processor properties to provide a high threshold for merging flowfiles. If you know the
number of log files per hour a priori, you can set that as the “Minimum Number of Entries”
and ensure that output will wait until that many flowfiles have been accumulated.

Also, given that you have described a “loop”, I would imagine you may have multiple connections
feeding into MergeContent. MergeContent can have unexpected behavior with multiple incoming
connections, and so I would recommend adding a Funnel to aggregate all incoming connections
and provide a single incoming connection to MergeContent.

Please let us know if this helps, and if not, please share a template and some sample input
if possible. Thanks.

[1] https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.MergeContent/index.html


Andy LoPresto
alopresto@apache.org<mailto:alopresto@apache.org>
alopresto.apache@gmail.com<mailto:alopresto.apache@gmail.com>
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69

On Jun 1, 2016, at 11:52 AM, Huagen peng <huagen.peng@gmail.com<mailto:huagen.peng@gmail.com>>
wrote:

Hi,

In the data flow I am dealing with now, there are multiple (up to 200) logs associated with
a given hour.  I need to process these fragment hourly logs and then concatenate them into
a single file.  The approach I am using now has an UpdateAttribute processor to set an arbitrary
segment.original.filename attribute on all the flowfiles I want to merge.  Then I use a MergeContent
processor, with an UpdateAttribute and RouteOnAttribute processor to form a loop to confirm
five times that the merge is complete.  Even with that, I occasionally still get more than
one merged flowfile.

Is there a better way to do this?  Or should I increase the loop count, say 10?

Thanks.

Huagen




Mime
View raw message