nifi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joe Skora <jsk...@gmail.com>
Subject Re: Proposal: New file processors: GetFIleData and PutFileData
Date Fri, 25 Sep 2015 03:16:06 GMT
It may be an oversimplification, but for the purposes of understanding, is
the intent to mirror directory tree with NiFi similar to rsync?

On Wed, Sep 23, 2015 at 11:26 PM, Rick Braddy <rbraddy@softnas.com> wrote:

> Joe,
>
> Thanks for the quick response.
>
> Yes, I can add to the Wiki once access has been granted. Further responses:
>
> >> GetFile and PutFile do support recursive walking/reconstruction based
> on relative paths
>
> Based on my recent testing of 0.3.0, GetFile does walk the configured
> directory tree, picking up the files it finds; however, only files are sent
> to PutFile, which places them all into a single target folder (not a
> directory tree - no directory information is sent by GetFile nor processed
> by PutFile from what I have seen, so I do not believe it reconstructs the
> directory tree at all today).
>
> >> I do think your proposal modified to consider the design pattern of
> ListFile/FetchFile would be super powerful.
>
> We have another processor GetFileList that uses "find" to traverse a
> target folder tree and feeds the resulting newline delimited file/directory
> stream as FlowFiles into GetFileData.  Perhaps that processor could be
> evolved into a suitable ListFiles processor.
>
> I believe GetFileList/GetFileData correspond roughly to the
> ListFile/FetchFile concept, based on a cursory review of
> ListHDFS/FetchHDFS.  If it's a matter of renaming that's obviously trivial
> at this point.  I'm assuming there are other facets to that List/Fetch
> design pattern - is it documented anywhere I can review to learn more?
>
> So when we have a ListFile/FetchFile what is the corresponding "Put" side
> of the flow to be?  Perhaps simply PutFile enhanced to handle FlowFiles
> from both basic GetFile and the richer FetchFile (modified GetFileData)
> types of FlowFiles and behaviors would suffice.
>
> >> Just need to make sure backpressure works through the flow so that you
> could literally handle the delivery of a file which is of itself larger
> than the repo by capturing and sending a chunk of it at a time for instance.
>
> Agreed. Are there any best practices documented for configuring
> backpressure properly?
>
> Thanks.
>
> Rick
>
> -----Original Message-----
> From: Joe Witt [mailto:joe.witt@gmail.com]
> Sent: Wednesday, September 23, 2015 6:25 PM
> To: dev@nifi.apache.org
> Subject: Re: Proposal: New file processors: GetFIleData and PutFileData
>
> Rick
>
> This is a perfectly fine place to start the thread.  If you'd like to
> create a wiki feature proposal for it too like we're doing with a lot of
> the other things at this level we can give you access to create one here
> [1].
>
> Not at all trying to take away from the points you were making but GetFile
> and PutFile do support recursive walking/reconstruction based on relative
> paths.  By no means is that as comprehensive as you're going for here
> though - just an FYI.
>
> These sound like good things.  In particular I find your concept for
> handling arbitrarily large data interesting.  Just need to make sure
> backpressure works through the flow so that you could literally handle the
> delivery of a file which is of itself larger than the repo by capturing and
> sending a chunk of it at a time for instance.  So from a brief historical
> perspective the GetFile / PutFile processors were literally the first two
> processors ever build for NiFi back when it had no GUI, no provenance, no
> nothin' that was cool.  These are the OGs of NiFi.  They been improved a
> bit over the years but not much.
> Why?  Because their utility was largely limited to trivial archiving
> cases.  We have recently had discussions about making them more powerful
> through the concept of ListFile/FetchFile like adam mentions and as we've
> started doing with things like HDFS.  A much better model for sure.  Still
> not as powerful as what you're cooking up though.  I do think your proposal
> modified to consider the design pattern of ListFile/FetchFile would be
> super powerful.  In your case ListFile for a single larger file for
> instance could produce N listings that point to the same file on disk but
> for different offset/ranges.  This would be *very* interesting.  I am a bit
> concerned about how to have this nicely handle competing consumer problems
> but...we can cross that bridge later.
>
> If you're willing to tackle this we can definitely work with you to bring
> it in.  It is a non-trivial contribution for sure.  Folks often do not
> consider all the nasty gotchas that can occur in something as seemingly
> simple as File IO.
>
> Thanks
> Joe
>
> [1]
> https://cwiki.apache.org/confluence/display/NIFI/NiFi+Feature+Proposals
>
> On Wed, Sep 23, 2015 at 1:42 PM, Rick Braddy <rbraddy@softnas.com> wrote:
> > This thread proposes community review/comments of modified versions of
> GetFile and PutFile for potential future adoption by the Nifi community.
> For those who want to jump straight to the code, here's the review
> repository location for the current version:
> https://github.com/rickbraddy/nifishare.
> >
> > As background, we needed a way to replicate entire directory trees of
> files via Nifi, where multiple directory trees can be specified at run-time
> as part of an overall Nifi graph. As Nifi is rooted in file-based
> processing, it seems reasonable to continue advancing its abilities to
> ingest, process, transform and replicate files in the most flexible manner
> possible.  While this proposal is not a be all end all in that regard, it
> moves the needle in the right direction by making file-processing in Nifi
> more dynamic, enabling flows to determine how files (and directories)
> should be processed, which does well beyond today's basic file
> ingress/egress process capabilities (which certainly have their place and
> uses).  Whether it's via this proposal and code or another, clearly Nifi
> can benefit from this type of functionality.
> >
> > Here's a more detailed explanation of the rationale for developing these
> Nifi file processor derivatives and their initial implementation:
> >
> > GetFileData
> > ----------------
> > The GetFile processor monitors a single directory tree for file changes
> and creates FlowFiles for every changed file in that configured tree. It
> does a good job of getting files from a configurable folder than need to be
> injected into a graph. GetFile falls short of other requirements that arise
> for general-purpose file processing:
> >
> > -          Operates from a single, pre-configured source directory (not
> dynamically configurable at run-time as part of a flow)
> >
> > -          Scheduled on a periodic basis only, not event-triggered when
> there's something to do
> >
> > -          Does not support sending an entire directory tree (only files
> are sent, not directories)
> >
> > -          Is a "source" processor node only, cannot be used within
> other Nifi flow logic that dynamically determines which files or
> directories to get and send as FlowFiles
> >
> > -          Assumes each file is smaller than the content repository,
> which causes large files (hundreds of MB's, GBs, TBs) to overrun or
> dominate the content repository
> >
> > A modified version of GetFile (currently) named GetFileData has been
> developed and is proposed as the basis for a new Nifi processor that will
> supplement file ingestion with these features:
> >
> > -          Operates based upon inbound FlowFiles that contains the
> filesystem path to a file or directory
> >
> > -          Scheduled by incoming FlowFiles containing a file or
> directory path, only runs when there's something to do
> >
> > -          Supports sending directory tree as a series of directory and
> file paths; e.g., ExecuteProcess("find /mypath -print") =>
> SplitText(newline) => ModifyAttribute(add "file.roodir=/mypath") =>
> GetFIleData ...
> >
> > -          Participates within simple or complex flows to fetch and send
> files and directories
> >
> > -          (To be developed) Is designed to handle any size file, by
> breaking files larger than a "chunkingThreshold" into a series of multiple
> smaller files that can be reassembled on the other end (by PutFileData)
> >
> > PutFileData
> > ---------------
> > The PutFile processor accepts incoming FlowFiles and writes those files
> to a single target directory.  It does a good job of handling and resolving
> conflicts, but falls short of other requirements that arise for
> general-purpose file processing:
> >
> > -          Does not support directories, only files
> >
> > -          Only supports a single, preconfigured target directory
> >
> > -          Cannot reconstruct and entire directory tree based upon
> relative file paths (all files go into a single target directory)
> >
> > -          Assumes each file is small enough to fit into the content
> repository
> >
> > A modified version of PutFile (currently) named PutFileData has been
> developed and is proposed as the basis for a new Nifi processor that will
> supplement file egress with these features:
> >
> > -          Supports directories and files
> >
> > -          Supports reconstruction of entire directory tree based upon
> relative file paths, enabling reconstruction of an entire directory free
> originating from GetFileData
> >
> > -          (To be developed) Is designed to handle any size file, by
> reassembling multi-part files into very large files (TB's) that do not fit
> within the content repository
> >
> > Should the community have an interest in these processors (we can name
> them something different, if needed), these contributions are now
> available.  In the meantime, we shall continue developing these processor
> to meet our specific use cases, adding the chunking functionality and QA
> certifying them for production use at scale.
> >
> > Looking forward to comments, feedback and recommendations.
> >
> > Here's the Github repo link again:
> > https://github.com/rickbraddy/nifishare
> >
> > Best,
> > Rick
> >
> > P.S. If there's a better vehicle for communicating these types of
> proposals, please advise.
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message