metron-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matt Foley <mfo...@hortonworks.com>
Subject Re: [DISCUSS] Using JSON Path to support more complex documents with the JSONMap Parser
Date Thu, 25 Jan 2018 20:23:30 GMT
Oh, my understanding was that you were proposing to read the inbound json as a stream, potentially
before the generation of that json was finished, thereby saving memory (well, allowing it
to be recycled sooner and in smaller pieces), and decreasing latency.

And I thought JsonPath was capable of that.  Is it?

Of course if we’re still going to read the whole json “document” first, then the only
question of interest is how to query/extract any given piece.  Sorry if I misunderstood.
--Matt

From: Otto Fowler <ottobackwards@gmail.com>
Date: Thursday, January 25, 2018 at 12:14 PM
To: "dev@metron.apache.org" <dev@metron.apache.org>, Matt Foley <mattf@apache.org>
Subject: Re: [DISCUSS] Using JSON Path to support more complex documents with the JSONMap
Parser

Sure it helps, but I am not sure I answered __your__ questions?

As I mentioned, we already use


Map<String,

Object> rawMap
= JSONUtils.INSTANCE.load(originalString,

new
TypeReference<Map<String,

Object>>() {





});





So, using JSONPath which is using the same object mapper operation under the covers is not
a change.
We were already reading the complete document in.



On January 25, 2018 at 15:06:28, Matt Foley (mattf@apache.org<mailto:mattf@apache.org>)
wrote:
Heh, as I said, I was looking in Python. For SAX-like JSON parsers I found numerous libraries,
most built on top of an underlying Python library named ijson, which is itself based on a
C library called yajl.

The yajl page (http://lloyd.github.io/yajl/ ) lists a double handful of language bindings
but, annoyingly, none for Java; nor does Google seem to know of any.

In Java, there's a library named json-simple in the Google Code Archive which claims a SAX-like
interface and broad production-level adoption/robustness: https://code.google.com/archive/p/json-simple/
. I don't have experience with it.

Of course, the gold standard json library for Java is Jackson. It documents stream-based parsing,
but not "SAX-like".
https://github.com/FasterXML/jackson-docs/wiki/JacksonStreamingApi indicates that using it
is equivalent to writing a parser, which suggests (disappointingly) somewhat lower-level than
SAX api.
http://www.cowtowncoder.com/blog/archives/2009/01/entry_132.html compares Jackson streaming
interface to Stax and SAX, and says it is like Stax Cursor api, claiming simpler use than
SAX (about which I have no opinion).
So I think most people use Jackson for non-streaming consumption of json.

JsonPath implementation uses Jackson under the hood, which seems good to me -- professionals
don't recreate the wheel.
And it has the charm (for this community) of a DSL-like interface. It's likely a good choice.

Hope this helps,
--Matt

On 1/25/18, 10:05 AM, "Otto Fowler" <ottobackwards@gmail.com<mailto:ottobackwards@gmail.com>>
wrote:

In other words, I don’t believe the issue is parsing, but rather searching
and extracting.

I have used SAX with xml as well, can you point me to the json equivalent
you found?


On January 25, 2018 at 13:01:58, Otto Fowler (ottobackwards@gmail.com<mailto:ottobackwards@gmail.com>)
wrote:

JSONPath is indeed what nifi uses. I used their implementation as a guide.
I believe starting with a path would be a good minimum viable, a good start.
We could support multiple paths of course.

Beside the fact that I knew NiFi used this approach, I believe that
JSONPath provides a flexible mechanism for defining
the targets within the document, and would make this more usable across
various document structures.

We already do full document with simple json btw.

On January 25, 2018 at 12:45:12, Matt Foley (mattf@apache.org<mailto:mattf@apache.org>)
wrote:

Hi Otto,
Oddly, I had reason a couple weeks ago to try to figure out a streaming
parser for very large json objects -- altho it was in Python rather than
Java.
Search showed two basic approaches, both unsurprisingly modeled on xml
processing:
- SAX-like parsing
- XPath-like parsing

Both are capable of true streaming interface, that is one doesn't have to
load the whole json into memory first.
The sound-bite comparison of the two, thanks to stackoverflow, is:

> SAX is a top-down parser and allows serial access to a XML document, and
works well for read only [serial, streamed] access.
> XPath is useful when you only need a couple of values from the XML
document, and you know where to find them (you know the path of the data,
/root/item/challange/text).
> [XPath is] certainly easier to use, ... whereas ... SAX will always be a
lot more awkward to program than XPath.

Having used SAX before, I agree it's got an "awkward" api, but it's quite
usable and does the job.
I haven't been hands-on with XPath.

Is XPath (or rather JSONPath) what NiFi uses?
And is it sufficient for our needs to have a fixed path to the message
sequence in any given json bundle?

Thanks,
--Matt


On 1/25/18, 7:57 AM, "Otto Fowler" <ottobackwards@gmail.com<mailto:ottobackwards@gmail.com>>
wrote:

While it would be preferred if all data streamed into the parsers is
already in ‘stream’ form, as opposed to ‘batched’ form, it may not always
be possible, or possible at every step of system development.

I was wondering if it would be worth adding optional support to the JSONMap
Parser to support more complex documents, and split them in the parser into
multiple messages. This is similar in function to the JSON Splitter
processor in NiFi

So, a document would come into the JSONMap Parser from Kafka, with some
embedded set of the real message content, such as in this simplified
example:

{
“messages" : [
{ message1},
{ message2},
….
{messageN}
]
}

the JSONMap Parser, would have a new configuration item for message
selection, that would be a JSON Path expression

“messageSelector” : “$.messages “

Inside the JSONMap Parser, it would evaluate the expression, and do the
same processing on each item returned by the expression list.

the Parser interface already supports returning multiple message objects
from a single byte[] input.

There is a performance penalty to be paid here, and it is more than just
doing more than one message due to the JSONPath evaluation.

I can see this being useful in a couple of circumstances:

-

You want to work with some document format with metron but do not have
NiFi or the equivalent available or setup yet
-

You want to prototype with Metron before you get the ‘preprocessing’
setup
-

You are not going to be able to use NiFi and are ok with the performance

I have something in github to look at for more detail :
ottobackwards/json-path-play
<https://github.com/ottobackwards/json-path-play>

Thoughts?


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message