nifi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 彭光裕 <rolandp...@cht.com.tw>
Subject RE: ‘On primary node’ strategy of GetHDFS maybe not working
Date Wed, 12 Aug 2015 07:06:49 GMT
Thank you Joe again.

ListHDFS(primary node) + FetchHDFS(cluster) is a good idea. I’ll remove GetHDFS and try
this suggestion later.

By the way, as the attached picture you can see. GetHDFS only get 1 enqueue, but put 2 dequeue,
so CompressContent get 2 in and 2 out. That makes the file content two times as what I expected.

[cid:image002.png@01D0D50F.E909E570]
I’m sure the property “keep source file” is false, but the duplicated pulling is still
happened.

However, for the sake of distributed apply, ListHDFS and FetchHDFS would be better than only
GetHDFS(primary node).

Thanks again,

Roland.
From: Joe Witt [mailto:joe.witt@gmail.com]
Sent: Wednesday, August 12, 2015 10:43 AM
To: users@nifi.apache.org
Subject: Re: ‘On primary node’ strategy of GetHDFS maybe not working

Hello

GetHDFS pulls the file from HDFS then deletes the original.  It is possible for race conditions
to occur though seems unlikely if you have primary node only doing the pull.  It is likely
better at this point to use the 'ListHDFS' processor followed by the 'FetchHDFS' processor.
 You can run the ListHDFS processor on a single node (primary node) and then send the listing
results across the cluster using site-to-site if necessary and from there use FetchHDFS.

All that is probably overkill though.  First step is to figure out why you are seeing duplication.
 Is NiFi unable to delete the original file?  Please be sure on GetHDFS "keep source file"
is false.  If it is true then NiFi would keep pulling it.  However, by using ListHDFS and
FetchHDFS you can pull in an idempotent manner.  For that case you use a Distributed Cache
Service which shares state about listings seen across the cluster.

Please let us know if this helps or if you would like more pointers.  This is of course a
really common use case so if we need to better document the pattern we're happy to do so.

Thanks
Joe

On Tue, Aug 11, 2015 at 9:31 PM, 彭光裕 <rolandpeng@cht.com.tw<mailto:rolandpeng@cht.com.tw>>
wrote:
[cid:image001.gif@01D0D50F.2C3DD5A0]
hi,

     My flow has a GetHDFS processor. My question is that I always get many copies of the
same output files through this processor, no matter the scheduling strategy is ‘On primary
node’ or ‘Timer Driven’. I thought ‘On primary node’ will only get one copy from
HDFS, but it doesn’t.
My working environment is a nifi cluster with two worker nodes. I guess ‘On primary node’
strategy of GetHDFS maybe not working, so that all the nodes invoke GetHDFS and the race condition
happens.

Any advices will be welcome, thank you!

Roland.



本信件可能包含中華電信股份有限公司機密資訊,非指定之收件者,請勿蒐集、處理或利用本信件內容,並請銷毀此信件.
如為指定收件者,應確實保護郵件中本公司之營業機密及個人資料,不得任意傳佈或揭露,並應自行確認本郵件之附檔與超連結之安全性,以共同善盡資訊安全與個資保護責任.
Please be advised that this email message (including any attachments) contains confidential
information and may be legally privileged. If you are not the intended recipient, please destroy
this message and all attachments from your system and do not further collect, process, or
use them. Chunghwa Telecom and all its subsidiaries and associated companies shall not be
liable for the improper or incomplete transmission of the information contained in this email
nor for any delay in its receipt or damage to your system. If you are the intended recipient,
please protect the confidential and/or personal information contained in this email with due
care. Any unauthorized use, disclosure or distribution of this message in whole or in part
is strictly prohibited. Also, please self-inspect attachments and hyperlinks contained in
this email to ensure the information security and to protect personal information.

Mime
View raw message