metron-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matt Foley <ma...@apache.org>
Subject Re: [DISCUSS] Using JSON Path to support more complex documents with the JSONMap Parser
Date Thu, 25 Jan 2018 17:45:07 GMT
Hi Otto,
Oddly, I had reason a couple weeks ago to try to figure out a streaming parser for very large
json objects -- altho it was in Python rather than Java.
Search showed two basic approaches, both unsurprisingly modeled on xml processing:
- SAX-like parsing
- XPath-like parsing

Both are capable of true streaming interface, that is one doesn't have to load the whole json
into memory first.
The sound-bite comparison of the two, thanks to stackoverflow, is:

> SAX is a top-down parser and allows serial access to a XML document, and works well for
read only [serial, streamed] access. 
> XPath is useful when you only need a couple of values from the XML document, and you
know where to find them (you know the path of the data, /root/item/challange/text).
> [XPath is] certainly easier to use, ... whereas ... SAX will always be a lot more awkward
to program than XPath.

Having used SAX before, I agree it's got an "awkward" api, but it's quite usable and does
the job.
I haven't been hands-on with XPath.

Is XPath (or rather JSONPath) what NiFi uses?  
And is it sufficient for our needs to have a fixed path to the message sequence in any given
json bundle?

Thanks,
--Matt


On 1/25/18, 7:57 AM, "Otto Fowler" <ottobackwards@gmail.com> wrote:

    While it would be preferred if all data streamed into the parsers is
    already in ‘stream’ form, as opposed to ‘batched’ form, it may not always
    be possible, or possible at every step of system development.
    
    I was wondering if it would be worth adding optional support to the JSONMap
    Parser to support more complex documents, and split them in the parser into
    multiple messages. This is similar in function to the JSON Splitter
    processor in NiFi
    
    So, a document would come into the JSONMap Parser from Kafka, with some
    embedded set of the real message content, such as in this simplified
    example:
    
    {
        “messages" : [
            { message1},
            { message2},
            ….
            {messageN}
        ]
    }
    
    the JSONMap Parser, would have a new configuration item for message
    selection, that would be a JSON Path expression
    
    “messageSelector” : “$.messages “
    
    Inside the JSONMap Parser, it would evaluate the expression, and do the
    same processing on each item returned by the expression list.
    
    the Parser interface already supports returning multiple message objects
    from a single byte[] input.
    
    There is a performance penalty to be paid here, and it is more than just
    doing more than one message due to the JSONPath evaluation.
    
    I can see this being useful in a couple of circumstances:
    
       -
    
       You want to work with some document format with metron but do not have
       NiFi or the equivalent available or setup yet
       -
    
       You want to prototype with Metron before you get the ‘preprocessing’
       setup
       -
    
       You are not going to be able to use NiFi and are ok with the performance
    
    I have something in github to look at for more detail :
    ottobackwards/json-path-play
    <https://github.com/ottobackwards/json-path-play>
    
    Thoughts?
    



Mime
View raw message