nifi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joe Percivall <joeperciv...@yahoo.com>
Subject Re: queued files
Date Tue, 24 Nov 2015 16:29:34 GMT
Hello Charlie,

I was looking back through and saw this wasn't totally resolved yet. 


Couple questions. First, what system are you using? There are a couple of options for the
stream command depending on what you're using. Also are you able to get new commands (using
yum or brew)?

The key thing I want to solve is to find the encoding of a file just based on it contents
and not relying on having access to the original file. ExecuteStreamCommand should enable
this. This is because you can just pass any FlowFile into ExecuteStreamCommand then it can
route the FlowFile contents to STDIN for the command to execute on.

Mac's (what I am using) default command for finding file encodings is "file -bi filename.txt"
but it doesn't allow you to pass in a file via STDIN. I found a command called "uchardet"[1]
which finds file encodings and allows you to pass the file in via STDIN. 

I attached a template that takes in a file using GetFile (deletes the original) and routes
that FlowFile to ExecuteStreamCommand. ExecuteStreamCommand then runs "uchardet" on the contents
of the FlowFile and outputs the encoding to the "encoding" attribute of the original FlowFile.
 
[1] https://github.com/BYVoid/uchardet

If this doesn't satisfy your needs just let me know!
Joe

- - - - - - 
Joseph Percivall
linkedin.com/in/Percivall
e: joepercivall@yahoo.com




On Friday, November 20, 2015 9:53 AM, Charlie Frasure <charliefrasure@gmail.com> wrote:



I'm definitely game for that.  Let me know what I can do to help.



On Fri, Nov 20, 2015 at 9:35 AM, Joe Witt <joe.witt@gmail.com> wrote:

Charlie
>
>Got ya.  I missed the 'encoding vs content type' thing.  I agree let's
>find a way to avoid the extra copy.  We dont expose the storage
>location of the underlying bytes.  So on the ListFile thing.  What I
>was thinking was this (and honestly I've not tested this so maybe i'm
>skipping something important)
>
>ListFile to get a listing of names/etc.. of interest
>
>Execute the 'file --mime-encoding ${filename}' to get more attributes
>available to work with
>
>RouteOnAttribute to decide what to do with the file next.  You can
>Fetch/delete what you don't want you can Fetch/pass on what you do
>
>I was looking for a way to check the mime-encoding while passing the
>data to detect into an input stream.  because that is actually how
>execute stream command wants to work.
>
>This is a use case that should be pretty easy so if you're willing to
>chat through it with us we'll figure out a path to make it work well.
>
>Thanks
>Joe
>
>On Fri, Nov 20, 2015 at 9:17 AM, Charlie Frasure
>
><charliefrasure@gmail.com> wrote:
>> Thanks Joe,
>>
>> The use case is that I'm receiving data without knowing what character set
>> it is coming in.  --mime-encoding is giving it's best guess on character set
>> rather than the content type.
>>
>> The ListFile sounds interesting, but I wonder if I really even need that.  I
>> don't want to leave the files in place, I just want to run an external
>> command on them as part of the data flow.  Is there a way I can run an
>> external command against the physical file such as
>> /opt/nifi/somedir/12345.uuid?  Would that info be in an attribute somewhere?
>> It just seems wasteful to make an extra copy of the file, in order to run a
>> read-only command on it, then delete it.  If ListFiles is still the right
>> way to go, please let me know.
>>
>>
>> On Fri, Nov 20, 2015 at 6:45 AM, Joe Witt <joe.witt@gmail.com> wrote:
>>>
>>> For identifying the mime type you may have sufficient results with the
>>> existing processor 'IdentifyMimeType' which you can put into the flow.
>>>
>>> For better logic around identifying files to pull but first calling an
>>> external command to learn more about them the upcoming
>>> ListFile/FetchFile combo that comes from this JIRA [1] might give you
>>> better flexibility.
>>>
>>> [1] https://issues.apache.org/jira/browse/NIFI-631
>>>
>>> Thanks
>>> Joe
>>>
>>> On Fri, Nov 20, 2015 at 12:08 AM, Charlie Frasure
>>> <charliefrasure@gmail.com> wrote:
>>> > Thanks everyone for the help.  The trouble started a few processors
>>> > earlier
>>> > in an ExecuteStreamCommand on ${filename} with the result of "file not
>>> > found".  I had originally set my GetFile processor to not remove files,
>>> > but
>>> > recently changed that.  Now it seems that my ExecuteStreamCommand may
>>> > not be
>>> > the best way to accomplish this.
>>> >
>>> > The command that gets executed is: file -b --mime-encoding ${filename}
>>> > in the working directory: ${absolute.path}
>>> >
>>> > Now that the file is no longer in the source directory when the
>>> > processor
>>> > fires, the command is broken.  I could PutFile somewhere temporarily; is
>>> > there a better way?
>>> >
>>> > On Thu, Nov 19, 2015 at 10:33 PM, Joe Witt <joe.witt@gmail.com> wrote:
>>> >>
>>> >> Charlie,
>>> >>
>>> >> The fact that this is confusing is something we agree should be more
>>> >> clear and we will improve.  We're tackling it based on what is
>>> >> mentioned here [1].
>>> >>
>>> >> [1]
>>> >>
>>> >> https://cwiki.apache.org/confluence/display/NIFI/Interactive+Queue+Management
>>> >>
>>> >> Thanks
>>> >> Joe
>>> >>
>>> >> On Thu, Nov 19, 2015 at 10:30 PM, Corey Flowers
>>> >> <cflowers@onyxpoint.com>
>>> >> wrote:
>>> >> > These guys are right. The file to look in for the uuid is the
>>> >> > nifi-app.log.
>>> >> > Also if you wanted to see what the processor itself was doing,
you
>>> >> > could
>>> >> > right click on the processor, get its uuid and while it is running,
>>> >> > run
>>> >> > (assuming it is on Linux):
>>> >> >
>>> >> > tail -F nifi-app.log | grep uuid
>>> >> >
>>> >> > This will just scroll the logs for that specific processor and
will
>>> >> > show
>>> >> > you
>>> >> > what it is doing. It should also tell you specific file names and
>>> >> > uuids
>>> >> > of
>>> >> > the failing files.
>>> >> >
>>> >> > Hope that helps! Have a great night and good luck!
>>> >> >
>>> >> > Sent from my iPhone
>>> >> >
>>> >> > On Nov 19, 2015, at 9:27 PM, Juan Sequeiros <hellojuan@gmail.com>
>>> >> > wrote:
>>> >> >
>>> >> > You can also check the NiFi logs for a searchable id or for what
the
>>> >> > previous processor ID produced to help search provenance.
>>> >> >
>>> >> > On Nov 19, 2015 21:22, "Bryan Bende" <bbende@gmail.com> wrote:
>>> >> >>
>>> >> >> Charlie,
>>> >> >>
>>> >> >> The behavior you described usually means that the processor
>>> >> >> encountered
>>> >> >> an
>>> >> >> unexpected error which was thrown back to the framework which
rolls
>>> >> >> back the
>>> >> >> processing of that flow file and leaves it in the queue, as
opposed
>>> >> >> to
>>> >> >> an
>>> >> >> error it expected where it would usually route to a failure
>>> >> >> relationship.
>>> >> >>
>>> >> >> Is the id that you see in the bulletin a uuid?
>>> >> >>
>>> >> >> There should still be some provenance events for this FlowFile
from
>>> >> >> the
>>> >> >> previous points in the flow. If it looks like the uuid of the
>>> >> >> FlowFile,
>>> >> >> that
>>> >> >> should be searchable from provenance using the search button
on the
>>> >> >> right.
>>> >> >> Let us know if we can help more.
>>> >> >>
>>> >> >> -Bryan
>>> >> >>
>>> >> >> On Thu, Nov 19, 2015 at 9:10 PM, Charlie Frasure
>>> >> >> <charliefrasure@gmail.com> wrote:
>>> >> >>>
>>> >> >>> I have a question on troubleshooting a flow.  I've built
a flow
>>> >> >>> with
>>> >> >>> no
>>> >> >>> exception routing, just trying to process the expected
values
>>> >> >>> first.
>>> >> >>> When a
>>> >> >>> file exposes a problem with the logic in my flow, it queues
up
>>> >> >>> prior
>>> >> >>> to the
>>> >> >>> flow that is raising the bulletin.
>>> >> >>>
>>> >> >>> In the bulletin, I can see an id, but can't tell which
file it is.
>>> >> >>> Data
>>> >> >>> provenance doesn't seem to help as it passed the flow on
the last
>>> >> >>> processor,
>>> >> >>> but hasn't been logged (to my knowledge) on the next one.
>>> >> >>>
>>> >> >>> Is there a way to match the bulletin back to a file without
>>> >> >>> creating a
>>> >> >>> route for failed files?
>>> >> >>
>>> >> >>
>>> >> >
>>> >
>>> >
>>
>>
>
Mime
View raw message