drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rafael Jaimes III <rafjai...@gmail.com>
Subject Re: REST data source?
Date Wed, 01 Apr 2020 02:40:28 GMT
Yes your initial assessment was correct, there is extra material other than
the data field.
The returned JSON has some top-level fields that don't go any deeper, akin
to your "status" : ok field. In the example I'm running now, one is called
MessageState which is set to "NEW". There's another field called
MessageData, which, obviously, holds most of the data. There are some other
top-level fields, and one is called MessageHeader which is nested. There's
a lot of stuff here, and this is just one "table" I'm querying against now.
Not sure how it will differ with the other services.

The service is definitely returning multiple records - I believe it's a
JSON array and Drill+HTTP/plugin appears to handle it quite well.

You're right, Drill is handling most of the structure by modifying my
SELECT statement as you suggested.

For filter pushdown, expressions of that form would be great. That's what I
had in mind too.

Thanks,
Rafael

On Tue, Mar 31, 2020 at 10:14 PM Paul Rogers <par0328@yahoo.com.invalid>
wrote:

> Hi Rafael,
>
> Thanks much for the info. We had already implemented filter push-down for
> other plugins, and for a few custom REST APIs, so should be possible to
> port it over to the HTTP plugin. If you can supply code, then you can
> convert filters to anything you want, a specialized JSON request body, etc.
> To do this generically, we have to make some assumptions, such as either 1)
> all fields can be pushed as query parameters, or 2) only those in some
> config list. Either way, we know how to create name=value pairs in either a
> GET or POST format.
>
> You mentioned that your "payload" objects are structured. Drill can
> already handle this; your query can map them to the top level:
>
> SELECT t.characteristic.color.name AS color_name,
> t.characteristic.color.confidence AS color_confidence, ...   FROM yourTable
> AS t
>
> You'll get that "out of box." Drill does assume that data is in "record
> format": a single list of objects which represent records. Code would be
> needed to handle, say, two separate lists of objects or other,
> more-general, JSON structures.
>
>
> My specific question was more around the response from your web service.
> Does that have extra material besides just the data records? Something like:
>
>
> { "status": "ok", "data": [ {characteristic: ... }, {...}] }
>
> Or, is the response directly an array of objects:
>
>  [ {characteristic: ... }, {...}]
>
>
> If it is just an array, then the "out of the box" plugin will work. If
> there is other stuff, then you'll need the new feature to tell Drill how to
> find the field to your data. The present version needs code, but I'm
> thinking we can just use an array of names in the plugin config:
>
> dataPath: [ "data" ],
>
> Or, in your case, do you get a single record per HTTP request? If a single
> record, then either your queries will be super-simple, or performance will
> be horrible when requesting multiple records. (The HTTP plugin only does
> one request and assumes it will get back a set of records as a JSON array
> or as whitespace-separated JSON objects as in a JSON file.)
>
> Can you clarify a bit which of these cases your data follows?
>
> I like your idea of optionally supplying a parser class for the "hard"
> cases:
>
> messageParserClass: "com.mycompany.drill.MyMessageParser",
>
> As long as the class is on the classpath, Java will find it.
>
> Finally, on the filter push-down, the existing code we're thinking of
> using can handle expressions of the form:
>
> column op constant
>
> Where "op" is one of the relational operators: =, !=, < etc. Also handles
> the obvious variations (const op constant, column BETWEEN const1 AND
> const2, column IN (const1, const2, ...)).
>
> The code cannot handle expressions (due to a limitation in Drill itself).
> That is, this won't work as a filter push-down: col = 10 + 2 or col + 2 =
> 10. Nor can it handle multi-column expressions: column1 = column2, etc.
>
>
> I'll write up something more specific so you can see exactly what we
> propose.
>
>
> Thanks,
> - Paul
>
>
>
>     On Tuesday, March 31, 2020, 6:39:57 PM PDT, Rafael Jaimes III <
> rafjaimes@gmail.com> wrote:
>
>  Either a text description of the parse path or specifying the class
> with the message parser could work.
> I think the latter would be better, if it were simple as dropping the
> JAR in 3rdparty after Drill is already built.
> That way we can just continually add parsers ad-hoc.
>
> An example JSON response includes about 4 top-level fields,
> then 2 of those fields have many sub-fields.
> For example a field could be nested 3 levels deep and say:
>
> Characteristic:
>
>   Color:
>
>       Color name: "Red"
>
>       Confidence: 100
>
>   Physical:
>
>       Size: 405
>
>       Confidence:  95
>
> As you can imagine, it would be difficult to flatten this because of
> repeated sub-field names like "Confidence".
>
> I don't think it would be easily exportable into a CSV.
> At least for me pandas dataframe is the ultimate destination for all
> of this, which also don't handle nested fields well either.
> I'll have to handle some parsing on my end.
>
> Filter pushdown would be huge and much desired.
> Our other end-users are accustomed to using SQL in that manner and the
> REST API we use fully support AND, OR, BETWEEN, =, <, >, etc (I can
> get a full list if you're interested).
> For example I think "between" is a ",". Converting the SQL statement
> into the URL format would be awesome and help streamline querying
> across data sources.
> This is one of the main reasons why we're so interested in Drill.
>
>
> Thanks,
>
> Rafael
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message