nifi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From B <dpexec...@gmail.com>
Subject Re: ListSFTP failing on large directories
Date Sun, 06 May 2018 00:21:55 GMT
Yeah I was thinking of separating things into folders but if I can at least
take chunks of 1.5 million files, I know I can pull 1.5 million flowfiles.

https://issues.apache.org/jira/browse/NIFI-5157

Oh exciting, I can help out too? That sounds pretty cool.

Yeah I really wish I can customize the file-age or pull certain folders, or
maybe trigger based on success of ingest down below in my flow. "oh all the
files completed, now move to next folder in ListSFTP at the top of the
flow", maybe "event-driven"?

On Sat, May 5, 2018 at 6:30 PM, Mark Payne <markap14@hotmail.com> wrote:

> Hello,
>
> When nifi receives this listing from the SFTP server, it will create a
> FlowFile for each remote file. This FlowFile contains a map of attributes.
> Additionally, it will create a provenance RECEIVE event. All of this is
> then stored in an internal data structure in the session object. So, all
> told you are probably looking at about 1-2 KB of Java heap used for each
> FlowFile. That means that for 10 million flowfiles you would need something
> on the order of 10-20 GB of heap space.
>
> Splitting the data directories up into smaller directories would certainly
> help. But then you would also unfortunately need N number of processors, I
> believe. If you use a recursive directory structure and configure the
> processor to recurse, I don’t think you’ll see an improvement.
>
> If the ListSFTP processor doesn’t already have a mechanism for batch size
> (so that you could set it to 100,000) then that would probably be a very
> useful feature to add. I think we can do this safely, as long as we emit
> the oldest data first and update the cluster’s state with each batch of
> FlowFiles.
>
> Do you mind creating a JIRA for that improvement? Also, if you are so
> inclined to delve into implementing the feature, those of us on the mailing
> list would be more than happy to help you get it across the finish line.
>
> Thanks!
> -Mark
>
> Sent from my iPhone
>
> On May 5, 2018, at 5:51 PM, B <dpexecute@gmail.com> wrote:
>
> Hi,
>  I have a directory where some other system was sending me tons of files
> into an SFTP server.
>
> Now ListSFTP works GREAT on every folder everywhere on this server. But
> this one folder, has I think around 10 million files or something (maybe
> too many).
>
> It takes a long time to run "ls" on this directory like 5-6 minutes for
> SSH to reply back.
>
> It freezes ListSFTP processor on primary node. I end up having to restart
> the Coordinator node after 30 minutes or hour of waiting.
>
> Why does "ls" come back but ListSFTP struggles more and freezes thread? Is
> there a way to limit the amount of files ListSFTP should pull in? I'd like
> it if I can divide it up into chunks of 100,000.
>
> Maybe I have to run some linux commands to split it apart into
> 100,000-file folders on the SFTP server?
>
> I'm using Nifi 1.3.0
>
> Thanks,
>
>
>
>
>
> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
Virus-free.
> www.avg.com
> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
> <#m_8825368771620278104_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>
>

Mime
View raw message