nifi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 彭光裕 <>
Subject RE: ‘On primary node’ strategy of GetHDFS maybe not working
Date Wed, 12 Aug 2015 08:39:49 GMT
Hi Joe,
     please ignore this “two times fileflow file” issue. I found that I have a connection
between PutHDFS and GetHDFS. That makes the extra fileflow to GetHDFS out and further processor.

After separation the flow like the picture attached, the fileflow is correct as what I expected.


Now I will keep trying how to implement distributed FetchHDFS and DistributeLoad in cluster

By the way, I find this commit (
Maybe it will be useful for me to do further work.

From: 彭光裕
Sent: Wednesday, August 12, 2015 3:07 PM
Subject: RE: ‘On primary node’ strategy of GetHDFS maybe not working

Thank you Joe again.

ListHDFS(primary node) + FetchHDFS(cluster) is a good idea. I’ll remove GetHDFS and try
this suggestion later.

By the way, as the attached picture you can see. GetHDFS only get 1 enqueue, but put 2 dequeue,
so CompressContent get 2 in and 2 out. That makes the file content two times as what I expected.

I’m sure the property “keep source file” is false, but the duplicated pulling is still

However, for the sake of distributed apply, ListHDFS and FetchHDFS would be better than only
GetHDFS(primary node).

Thanks again,

From: Joe Witt []
Sent: Wednesday, August 12, 2015 10:43 AM
Subject: Re: ‘On primary node’ strategy of GetHDFS maybe not working


GetHDFS pulls the file from HDFS then deletes the original.  It is possible for race conditions
to occur though seems unlikely if you have primary node only doing the pull.  It is likely
better at this point to use the 'ListHDFS' processor followed by the 'FetchHDFS' processor.
 You can run the ListHDFS processor on a single node (primary node) and then send the listing
results across the cluster using site-to-site if necessary and from there use FetchHDFS.

All that is probably overkill though.  First step is to figure out why you are seeing duplication.
 Is NiFi unable to delete the original file?  Please be sure on GetHDFS "keep source file"
is false.  If it is true then NiFi would keep pulling it.  However, by using ListHDFS and
FetchHDFS you can pull in an idempotent manner.  For that case you use a Distributed Cache
Service which shares state about listings seen across the cluster.

Please let us know if this helps or if you would like more pointers.  This is of course a
really common use case so if we need to better document the pattern we're happy to do so.


On Tue, Aug 11, 2015 at 9:31 PM, 彭光裕 <<>>

     My flow has a GetHDFS processor. My question is that I always get many copies of the
same output files through this processor, no matter the scheduling strategy is ‘On primary
node’ or ‘Timer Driven’. I thought ‘On primary node’ will only get one copy from
HDFS, but it doesn’t.
My working environment is a nifi cluster with two worker nodes. I guess ‘On primary node’
strategy of GetHDFS maybe not working, so that all the nodes invoke GetHDFS and the race condition

Any advices will be welcome, thank you!


Please be advised that this email message (including any attachments) contains confidential
information and may be legally privileged. If you are not the intended recipient, please destroy
this message and all attachments from your system and do not further collect, process, or
use them. Chunghwa Telecom and all its subsidiaries and associated companies shall not be
liable for the improper or incomplete transmission of the information contained in this email
nor for any delay in its receipt or damage to your system. If you are the intended recipient,
please protect the confidential and/or personal information contained in this email with due
care. Any unauthorized use, disclosure or distribution of this message in whole or in part
is strictly prohibited. Also, please self-inspect attachments and hyperlinks contained in
this email to ensure the information security and to protect personal information.

View raw message