nifi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andy LoPresto <alopre...@apache.org>
Subject Re: Reading flowfile in a stream callback
Date Fri, 03 Nov 2017 17:31:34 GMT
James,

I am not a Python expert, so I’m glad other people could weigh in. As far as routing on
content type, I agree with Joe’s sentiment that IdentifyMimeType and RouteOnAttribute are
the correct solutions there. You can route on a range of input options (the actual type, detected
charset, etc.).

I would definitely avoid putting code to handle multiple disparate content types (text vs.
video, etc.) in the same ExecuteScript processor. This will be harder to test, maintain, enhance,
etc. You’ll eventually reach a Switch Statement of Doom. Instead, approach this as each
ES processor is a black box like a Unix tool — it does one thing really well — and chain
them together. This is the philosophy NiFi is built on and you’ll have much more success
swimming with the current than fighting it.


Andy LoPresto
alopresto@apache.org
alopresto.apache@gmail.com
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69

> On Nov 3, 2017, at 6:05 AM, Joe Witt <joe.witt@gmail.com> wrote:
> 
> Mime type detection can be difficult business but I trust Apache Tika
> to do a far better job than I ever could.  The result you show for
> JSON appears correct and I'd simply add that string to the list of
> routing attributes that i treat as text.  Or I'd key off the charset
> being being provided as that would tell me enough to know it is text
> or however I wanted to treat it.
> 
> Thanks
> 
> On Fri, Nov 3, 2017 at 8:24 AM, James McMahon <jsmcmahon3@gmail.com> wrote:
>> I've always found that IdentifyMimeType returns a wide, wide range of values
>> for mime.type. There is often ambiguity that mime.type is a reliable
>> indicator of the nature of the content. To illustrate, I've passed file.txt
>> into Nifi that contains a string representation of json. I'd expect this to
>> be handled as textual data, but mime.type gets set to
>> application/json;charset=UTF-8.
>> 
>> Perhaps I am misusing the attribute mime.type. How have you worked around
>> this challenge Joe?
>> 
>> On Fri, Nov 3, 2017 at 7:54 AM, Joe Witt <joe.witt@gmail.com> wrote:
>>> 
>>> "How can discern binary or character content using conditional checks
>>> to be sure I handle the file properly?"
>>> 
>>> Use NiFi and the existing processors where able and extend/script only
>>> where necessary/critical.  For the case you mention use
>>> IdentifyMimeType and route appropriate data to the appropriate script
>>> execution.
>>> 
>>> Joe
>>> 
>>> On Fri, Nov 3, 2017 at 7:04 AM, James McMahon <jsmcmahon3@gmail.com>
>>> wrote:
>>>> Andy, regarding the the code sample you offered above - doesn't this put
>>>> into text both the attributes metadata and the payload of the flowfile?
>>>> 
>>>> If that is the case, how does one modify that to read in from the stream
>>>> into variable text only the file payload?
>>>> 
>>>> On Fri, Nov 3, 2017 at 5:48 AM, James McMahon <jsmcmahon3@gmail.com>
>>>> wrote:
>>>>> 
>>>>> Thank you Andy. I'd like to ask just a few quick follow up questions.
>>>>> 
>>>>> 1- My flow content may be textual characters, and it can also be binary
>>>>> -
>>>>> jpgs, pngs, and similar. How can discern binary or character content
>>>>> using
>>>>> conditional checks to be sure I handle the file properly? How would I
>>>>> alter
>>>>> this
>>>>> 
>>>>> text = IOUtils.toString(inputStream, StandardCharsets.UTF_8)
>>>>> 
>>>>> to read in the data from the stream as binary data in that case?
>>>>> 
>>>>> 2- In the case where my data in the flowfile payload is binary, do I
>>>>> have
>>>>> another version of this....
>>>>> 
>>>>> outputStream.write(bytearray(reversedText.encode('utf-8')))
>>>>> 
>>>>> ....that omits the encoding, like so:
>>>>> 
>>>>> outputStream.write(bytearray(some_binary))  ?
>>>>> 
>>>>> Thank you very much in advance. -Jim
>>>>> 
>>>>> On Thu, Nov 2, 2017 at 8:26 PM, Andy LoPresto <alopresto@apache.org>
>>>>> wrote:
>>>>>> 
>>>>>> James,
>>>>>> 
>>>>>> The Python API should be the same as the Java FlowFile.java interface
>>>>>> [1]. Matt Burgess’ blog has a good post about using Jython to do
>>>>>> flowfile
>>>>>> content manipulation. Something like:
>>>>>> 
>>>>>> flowFile = session.get()
>>>>>> if (flowFile != None):
>>>>>>  flowFile = session.write(flowFile,PyStreamCallback())
>>>>>>  session.transfer(flowFile, REL_SUCCESS)
>>>>>> 
>>>>>> With PyStreamCallback declared as a class above that block in the
>>>>>> script:
>>>>>> 
>>>>>> import java.io
>>>>>> from org.apache.commons.io import IOUtils
>>>>>> from java.nio.charset import StandardCharsets
>>>>>> from org.apache.nifi.processor.io import StreamCallback
>>>>>> 
>>>>>> class PyStreamCallback(StreamCallback):
>>>>>>  def __init__(self):
>>>>>>        pass
>>>>>>  def process(self, inputStream, outputStream):
>>>>>>    text = IOUtils.toString(inputStream, StandardCharsets.UTF_8)
>>>>>>    reversedText = text[::-1]
>>>>>> 
>>>>>>    outputStream.write(bytearray(reversedText.encode('utf-8')))
>>>>>> 
>>>>>> In Groovy, you can declare the StreamCallback as an inline closure
to
>>>>>> make this more compact, but I believe in Jython it needs to be a
>>>>>> separate
>>>>>> declaration. Hope this helps.
>>>>>> 
>>>>>> [1]
>>>>>> 
>>>>>> https://github.com/apache/nifi/blob/master/nifi-api/src/main/java/org/apache/nifi/flowfile/FlowFile.java
>>>>>> [2]
>>>>>> 
>>>>>> https://funnifi.blogspot.com/2016/03/executescript-json-to-json-revisited_14.html
>>>>>> 
>>>>>> 
>>>>>> Andy LoPresto
>>>>>> alopresto@apache.org
>>>>>> alopresto.apache@gmail.com
>>>>>> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
>>>>>> 
>>>>>> On Nov 2, 2017, at 12:53 PM, James McMahon <jsmcmahon3@gmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>> In python, I can use the requests library to post content something
>>>>>> like
>>>>>> htis:
>>>>>> 
>>>>>> import requests
>>>>>> url="https://abc.test.org"
>>>>>> files={'file':open('/somedir/myfile.txt','rb')}
>>>>>> r = requests.post(url,files=files)
>>>>>> 
>>>>>> If I am in a python stream callback, how can I read the flowfile
>>>>>> payload
>>>>>> in the same way that the open() reads its file from disk?
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>> 
>> 


Mime
View raw message