nifi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joe Skora <jsk...@gmail.com>
Subject Re: [jira] [Commented] (NIFI-994) Processor to tail files
Date Wed, 30 Sep 2015 17:27:09 GMT
I think we are on the same page, but I left out some details.  The key is
that the processor always starts at the beginning when it finds a file but
discards content it thinks was previously committed downstream.

One approach could be storing a checksum of processed content with the
other state when content is committed downstream.  Files are always handled
from the start, but those that exist when the processor starts are checked
against the stored state.  If the file has the same checksum at the same
offset as the state, the content up to the offset is discarded and the file
is processed from there on.  If the checksum at the offset is different,
all the content is processed.

Any content that ages off while the Processor is stopped will be lost, but
I don't see a way around that.  That said, it might be possible to
recognize some log rolling scenarios and finish processing rolled out files
that were previously in process while the regular behaviors pickup the new
file.

On Wed, Sep 30, 2015 at 11:42 AM, Joseph Percivall (JIRA) <jira@apache.org>
wrote:

>
>     [
> https://issues.apache.org/jira/browse/NIFI-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14937018#comment-14937018
> ]
>
> Joseph Percivall commented on NIFI-994:
> ---------------------------------------
>
> Adding an email chain that relates to this processor to the comments:
>
> For a NiFi processor, I think the "tail -F" makes more sense.  As opposed
> to the normal behavior that follows existing file descriptors, "tail -F"
> follows on filename (or pattern) so it tracks the current instance of a
> file, letting it handle new files during the run, log rotations, etc..
>
> I definitely agree that it should take a regex or a fixed filename.
>
> I think the biggest question is granularity.  Though tail is normally a
> line oriented operation, in NiFi it should probably be "chunk" oriented
> with each pass creating a new flow file with whatever new full lines are
> available.
>
> Joe Skora
>
> -------------
>
> Joe,
>
> The problem with "tail -F" is that if NiFi is restarted and then we do
> essentially "tail -F"
> we may have missed a lot of data that was written to the log file while
> NiFi was down.
> The idea behind this Processor is to be able to recover that data, even if
> it was written
> to a log file (or any other sort of file) while NiFi was not running or
> while the Processor
> was not running.
>
> I agree that it should be "chunk oriented" - likely would need a property
> that indicates how
> long to tail for a single chunk. E.g., tail for 1 second and create a
> FlowFile with the content
> received.
>
> -Mark
>
> > Processor to tail files
> > -----------------------
> >
> >                 Key: NIFI-994
> >                 URL: https://issues.apache.org/jira/browse/NIFI-994
> >             Project: Apache NiFi
> >          Issue Type: New Feature
> >    Affects Versions: 0.4.0
> >            Reporter: Joseph Percivall
> >            Assignee: Joseph Percivall
> >
> > It's a very common data ingest situation to want to input text into the
> system by "tailing" a file, most commonly log files. Currently we don't
> have an easy way to do this.
> > A simple processor to tail a file would benefit many users. There would
> need to be an option to not just tail a file but pick up where the
> processor left off if it is interrupted.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message