nifi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joe Witt <joe.w...@gmail.com>
Subject Re: Fetch change list
Date Wed, 29 Jul 2015 17:00:27 GMT
On 1) there are very few guarantees across os.  Some support locking but
many apps dont use it.  File io is wild wild west of idioms.

On 2) you certainly can tackle it that way.     This gets into the more art
than science part of designing and composing processors.  Key is to always
keep the operations person perspective in mind as the user.

Joe
On Jul 29, 2015 9:25 AM, "Joe Skora" <jskora@gmail.com> wrote:

> 1. Is there any reason it wouldn't work to try to open the files for write
> and only begin to handle it when it is writable?  It seems like a file
> source would typically open for write, write everything, and then close.
> Cases where something re-opens and appends would obviously not work in that
> case, but that seems a less likely situation.
>
> 2. Is there any value in breaking it into 3 phases, with a "selection"
> phase, "decision" phase, and "handling" phase?  The "selection" phase that
> lists ALL possible files to be considered, the "decision" phase determines
> which files to process, and the "handling" phase manages processing the
> selected files.  Processors in the "decision" provide the "combination of
> signals" Adam mentions, using what ever variety state and other factors
> necessary.  Extending the decision logic only requires a new processor.
> Obviously, there's still a bit of back-and-forth among the phase that would
> have to be worked out for managing file removal, etc.
>
> Joe
>
> On Wed, Jul 29, 2015 at 10:31 AM, Joe Witt <joe.witt@gmail.com> wrote:
>
> > Turning noatime on kicks last mod out the window.  It is for sure the
> > case when dealing with file IO that there really are no rules.  As
> > Adam notes it is about giving options/strategies.
> >
> > Surprisingly hard to do this well.  But good MVP options exist to get
> > something out and get more feedback on true need.
> >
> > On Wed, Jul 29, 2015 at 10:26 AM, Adam Taft <adam@adamtaft.com> wrote:
> > > Some additional feature requests for sake of consideration...
> > >
> > > For some file systems (I can think of one), the last modified date may
> > not
> > > be dependable or possibly not high enough precision.  Additional
> > strategies
> > > could be considered for determining whether a file has been previously
> > > processed.  For example, the byte size of the file, or the md5 hash, or
> > > possibly other signals.
> > >
> > > While these additional strategies may not be coded initially, I think
> > they
> > > would add nice features for the proposed AbstractListFileProcessor.  In
> > > this way, the abstract processor could use one or even a combination of
> > > signals to determine if a file has been modified and needs to be pulled
> > > again.
> > >
> > > Additionally, it might be good to have other mechanisms in place to
> mark
> > a
> > > file as unavailable.  The "dot file" convention is pretty common, but
> > there
> > > might be additional ways which indicates that a file is still be
> > > manipulated.  i.e. maybe not all writers to the file system understand
> > the
> > > dot file convention, and so other strategies might be required.
> > >
> > > For example, in one processor I worked with, it was required to pull
> the
> > > list of remote files twice in order to monitor the file sizes.  If the
> > file
> > > size stayed consistent between two pulls, it could safely be considered
> > > ready for processing.  However, if the file size differed in the two
> > pulls,
> > > we could assume that a client was still writing to the file.
> > >
> > > Adam
> > >
> > >
> > > On Wed, Jul 29, 2015 at 7:34 AM, Mark Payne <markap14@hotmail.com>
> > wrote:
> > >
> > >> Joe S,
> > >>
> > >> I agree, i think the design of List/Fetch HDFS is extremely applicable
> > to
> > >> this. The way it saves state is by
> > >> using a DistributedMapCacheServer. The intent is to run the List
> > processor
> > >> on primary node only, and it
> > >> will store its state there so that if the primary node is changed, any
> > >> other node can pick up where the
> > >> last one left off. In order to avoid saving a massive amount of state
> in
> > >> memory, it stores the timestamp of
> > >> the latest file that it has fetched, as well as all files that have
> that
> > >> same timestamp (timestamp = last modified date
> > >> in this case). So the next time it runs, it can pull just things whose
> > >> lastModifiedDate is later than or equal to
> > >> that timestamp, but it can still know which things to avoid pulling
> > twice
> > >> because we've saved that info as well.
> > >>
> > >> Now, with ListFile it will be a bit different. We tend to think of
> > GetFile
> > >> and List/Fetch File as pulling from a local
> > >> file system. However, it is also certainly used to pull from a
> > >> network-mounted file system. In this case, all nodes
> > >> in the cluster need the ability to pull the data in unison. So in this
> > >> case, we will want to save the state in such a way
> > >> that all nodes in the cluster have access to it, in case the primary
> > node
> > >> changes. But if the file is local, we don't want
> > >> to save state across the cluster, because each node needs its own
> state.
> > >> So that would likely just be an extra property
> > >> on the processor.
> > >>
> > >> If saving state locally, it's easy enough to just write to a text file
> > >> (recommend you allow user to specify the state file
> > >> and default it to conf/ListFile-<processor id>.state or something
like
> > >> that.
> > >>
> > >> I have not documented this pattern. Specifically because we've been
> > >> talking for a while about implementing the Simple
> > >> State Management but we just haven't gotten there yet. I expected that
> > we
> > >> would have that finished before writing many
> > >> more of these List/Fetch processors. That will radically change how we
> > >> handle all of this.
> > >>
> > >> But since it is not there... it may actually make sense to just
> refactor
> > >> the ListHDFS processor into an AbstractListFileProcessor
> > >> that is responsible for handling the state management. I am not sure
> how
> > >> complicated that would get, though. Just a
> > >> thought.
> > >>
> > >> Hopefully this helped to clear things up, rather than muddy them up :)
> > >> Feel free to fire back any questions.
> > >>
> > >> Thanks
> > >> -Mark
> > >>
> > >>
> > >> ----------------------------------------
> > >> > Date: Wed, 29 Jul 2015 06:42:39 -0400
> > >> > Subject: Re: Fetch change list
> > >> > From: joe.witt@gmail.com
> > >> > To: dev@nifi.apache.org
> > >> >
> > >> > JoeS
> > >> >
> > >> > Sounds great. I'd ignore my provenance comment as that was really
> > >> > more about how something external could keep tabs on progress, etc..
> > >> > Mark Payne designed/built the List/Fetch HDFS one so I'll defer to
> him
> > >> > for the good bits. But the logic to follow for saving state you'll
> > >> > want is probably the same.
> > >> >
> > >> > Mark - do you have the design of that thing documented anywhere? It
> > >> > is a good pattern to describe because it is effectively a model for
> > >> > taking non-scaleable dataflow interfaces and making them behave as
> if
> > >> > they were.
> > >> >
> > >> > Thanks
> > >> > JoeW
> > >> >
> > >> > On Wed, Jul 29, 2015 at 6:07 AM, Joe Skora <jskora@gmail.com>
> wrote:
> > >> >> Joe,
> > >> >>
> > >> >> I'm interested in working on List/FetchFile. It seems like starting
> > with
> > >> >> [NIFI-631|https://issues.apache.org/jira/browse/NIFI-631] makes
> > sense.
> > >> >> I'll look at List/FetchHDFS, but is there any further detail on
how
> > this
> > >> >> functionality should differ from GetFile? As for keeping state,
> > >> >> provenance was suggested, a separate state folder might work,
or
> some
> > >> file
> > >> >> systems support additional state that might be usable.
> > >> >>
> > >> >> Regards,
> > >> >> Joe
> > >> >>
> > >> >> On Tue, Jul 28, 2015 at 12:42 AM, Joe Witt <joe.witt@gmail.com>
> > wrote:
> > >> >>
> > >> >>> Anup,
> > >> >>>
> > >> >>> The two tickets in question appear to be:
> > >> >>> https://issues.apache.org/jira/browse/NIFI-631
> > >> >>> https://issues.apache.org/jira/browse/NIFI-673
> > >> >>>
> > >> >>> Neither have been claimed as of yet. Anybody interested in
taking
> > one
> > >> >>> or both of these on? It would be a lot like List/Fetch HDFS
so
> > you'll
> > >> >>> have good examples to work from.
> > >> >>>
> > >> >>> Thanks
> > >> >>> Joe
> > >> >>>
> > >> >>> On Tue, Jul 28, 2015 at 12:37 AM, Sethuram, Anup
> > >> >>> <anup.sethuram@philips.com> wrote:
> > >> >>>> Can I expect this functionality in the upcoming releases
of Nifi
> ?
> > >> >>>>
> > >> >>>> On 13/07/15 9:13 am, "Sethuram, Anup" <anup.sethuram@philips.com
> >
> > >> wrote:
> > >> >>>>
> > >> >>>>>Where is this 1TB dataset living today?
> > >> >>>>>[anup] Resides in a filesystem
> > >> >>>>>
> > >> >>>>>- What is the current nature of the dataset? Is it
already in
> large
> > >> >>>>>bundles as files or is it a series of tiny messages,
etc..? Does
> it
> > >> >>>>>need to be split/merged/etc..
> > >> >>>>>[anup] Archived files of size 3MB each collected over
a period.
> > >> Directory
> > >> >>>>>(1TB) -> Sub-Directories -> Files
> > >> >>>>>
> > >> >>>>>- What is the format of the data? Is it something that
can easily
> > be
> > >> >>>>>split/merged or will it require special processes to
do so?
> > >> >>>>>[anup] zip, tar formats.
> > >> >>>>>
> > >> >>>>>
> > >> >>>>>
> > >> >>>>>--
> > >> >>>>>View this message in context:
> > >> >>>>>
> > >> >>>
> > >>
> >
> http://apache-nifi-incubating-developer-list.39713.n7.nabble.com/Fetch-cha
> > >> >>>>>nge-list-tp1351p2126.html
> > >> >>>>>Sent from the Apache NiFi (incubating) Developer List
mailing
> list
> > >> >>>>>archive at Nabble.com.
> > >> >>>>>
> > >> >>>>>________________________________
> > >> >>>>>The information contained in this message may be confidential
and
> > >> legally
> > >> >>>>>protected under applicable law. The message is intended
solely
> for
> > the
> > >> >>>>>addressee(s). If you are not the intended recipient,
you are
> hereby
> > >> >>>>>notified that any use, forwarding, dissemination, or
reproduction
> > of
> > >> this
> > >> >>>>>message is strictly prohibited and may be unlawful.
If you are
> not
> > the
> > >> >>>>>intended recipient, please contact the sender by return
e-mail
> and
> > >> >>>>>destroy all copies of the original message.
> > >> >>>>
> > >> >>>>
> > >> >>>> ________________________________
> > >> >>>> The information contained in this message may be confidential
and
> > >> >>> legally protected under applicable law. The message is intended
> > solely
> > >> for
> > >> >>> the addressee(s). If you are not the intended recipient, you
are
> > hereby
> > >> >>> notified that any use, forwarding, dissemination, or reproduction
> of
> > >> this
> > >> >>> message is strictly prohibited and may be unlawful. If you
are not
> > the
> > >> >>> intended recipient, please contact the sender by return e-mail
and
> > >> destroy
> > >> >>> all copies of the original message.
> > >> >>>
> > >>
> > >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message