drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Rogers <par0...@yahoo.com.INVALID>
Subject Re: REST data source?
Date Wed, 01 Apr 2020 02:14:03 GMT
Hi Rafael,

Thanks much for the info. We had already implemented filter push-down for other plugins, and
for a few custom REST APIs, so should be possible to port it over to the HTTP plugin. If you
can supply code, then you can convert filters to anything you want, a specialized JSON request
body, etc. To do this generically, we have to make some assumptions, such as either 1) all
fields can be pushed as query parameters, or 2) only those in some config list. Either way,
we know how to create name=value pairs in either a GET or POST format.

You mentioned that your "payload" objects are structured. Drill can already handle this; your
query can map them to the top level:

SELECT t.characteristic.color.name AS color_name,   t.characteristic.color.confidence AS
color_confidence, ...   FROM yourTable AS t

You'll get that "out of box." Drill does assume that data is in "record format": a single
list of objects which represent records. Code would be needed to handle, say, two separate
lists of objects or other, more-general, JSON structures.


My specific question was more around the response from your web service. Does that have extra
material besides just the data records? Something like:


{ "status": "ok", "data": [ {characteristic: ... }, {...}] }

Or, is the response directly an array of objects:

 [ {characteristic: ... }, {...}]


If it is just an array, then the "out of the box" plugin will work. If there is other stuff,
then you'll need the new feature to tell Drill how to find the field to your data. The present
version needs code, but I'm thinking we can just use an array of names in the plugin config:

dataPath: [ "data" ],

Or, in your case, do you get a single record per HTTP request? If a single record, then either
your queries will be super-simple, or performance will be horrible when requesting multiple
records. (The HTTP plugin only does one request and assumes it will get back a set of records
as a JSON array or as whitespace-separated JSON objects as in a JSON file.)

Can you clarify a bit which of these cases your data follows?

I like your idea of optionally supplying a parser class for the "hard" cases:

messageParserClass: "com.mycompany.drill.MyMessageParser",

As long as the class is on the classpath, Java will find it.

Finally, on the filter push-down, the existing code we're thinking of using can handle expressions
of the form:

column op constant

Where "op" is one of the relational operators: =, !=, < etc. Also handles the obvious variations
(const op constant, column BETWEEN const1 AND const2, column IN (const1, const2, ...)).

The code cannot handle expressions (due to a limitation in Drill itself). That is, this won't
work as a filter push-down: col = 10 + 2 or col + 2 = 10. Nor can it handle multi-column expressions:
column1 = column2, etc.


I'll write up something more specific so you can see exactly what we propose.


Thanks,
- Paul

 

    On Tuesday, March 31, 2020, 6:39:57 PM PDT, Rafael Jaimes III <rafjaimes@gmail.com>
wrote:  
 
 Either a text description of the parse path or specifying the class
with the message parser could work.
I think the latter would be better, if it were simple as dropping the
JAR in 3rdparty after Drill is already built.
That way we can just continually add parsers ad-hoc.

An example JSON response includes about 4 top-level fields,
then 2 of those fields have many sub-fields.
For example a field could be nested 3 levels deep and say:

Characteristic:

  Color:

      Color name: "Red"

      Confidence: 100

  Physical:

      Size: 405

      Confidence:  95

As you can imagine, it would be difficult to flatten this because of
repeated sub-field names like "Confidence".

I don't think it would be easily exportable into a CSV.
At least for me pandas dataframe is the ultimate destination for all
of this, which also don't handle nested fields well either.
I'll have to handle some parsing on my end.

Filter pushdown would be huge and much desired.
Our other end-users are accustomed to using SQL in that manner and the
REST API we use fully support AND, OR, BETWEEN, =, <, >, etc (I can
get a full list if you're interested).
For example I think "between" is a ",". Converting the SQL statement
into the URL format would be awesome and help streamline querying
across data sources.
This is one of the main reasons why we're so interested in Drill.


Thanks,

Rafael
  
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message