drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jaimes, Rafael - 0993 - MITLL" <Rafael.Jai...@ll.mit.edu>
Subject RE: REST data source?
Date Wed, 01 Apr 2020 14:22:29 GMT
Hi all,

I built Charles' latest branch including the proxy setup. It appears to be 
working quite well going through the proxy.

I'll continue to test and report back if I find any issues.

Note: Beyond Paul's repo recommendations, I had to skip checkstyle to get the 
maven build to complete. You're probably already aware of that, I think it's 
just specific to this branch.


-----Original Message-----
From: Paul Rogers <par0328@yahoo.com.INVALID>
Sent: Wednesday, April 1, 2020 1:29 AM
To: user <user@drill.apache.org>
Subject: Re: REST data source?

Thanks, Charles.

As Charles suggested, I pushed a commit that replaces the "old" JSON reader 
with the new EVF-based one. Eventually this will allow us to use a "provided 
schema" to handle any JSON ambiguities.

As we've been discussing, I'll try to add the ability to specify a path to 
data: "response/payload/records" or whatever. With the present commit, that 
path can be parsed in code, but I think a simple path spec would be easier.

- Paul

    On Tuesday, March 31, 2020, 10:00:52 PM PDT, Charles Givre 
<cgivre@gmail.com> wrote:

 Hello all,
I pushed some updates to the REST PR to include initial work on proxy 
configuration.  I haven't updated the docs yet (until this is finalized).  It 
adds new config variables as shown below:

  "type": "http",
  "cacheResults": true,
  "connections": {},
  "timeout": 0,
  "proxyHost": null,
  "proxyPort": 0,
  "proxyType": null,
  "proxyUsername": null,
  "proxyPassword": null,
  "enabled": true
I started on getting Drill to recognize the proxy info from the environment, 
but haven't quite finished that.  The plan is for the plugin config to 
override environment vars.
Feedback is welcome.

@paul-rogers, I think you can push to my branch (or submit a PR?) and that 
will be included in the main PR.
-- C

> On Mar 31, 2020, at 10:40 PM, Rafael Jaimes III <rafjaimes@gmail.com> wrote:
> Yes your initial assessment was correct, there is extra material other
> than the data field.
> The returned JSON has some top-level fields that don't go any deeper,
> akin to your "status" : ok field. In the example I'm running now, one
> is called MessageState which is set to "NEW". There's another field
> called MessageData, which, obviously, holds most of the data. There
> are some other top-level fields, and one is called MessageHeader which
> is nested. There's a lot of stuff here, and this is just one "table" I'm 
> querying against now.
> Not sure how it will differ with the other services.
> The service is definitely returning multiple records - I believe it's
> a JSON array and Drill+HTTP/plugin appears to handle it quite well.
> You're right, Drill is handling most of the structure by modifying my
> SELECT statement as you suggested.
> For filter pushdown, expressions of that form would be great. That's
> what I had in mind too.
> Thanks,
> Rafael
> On Tue, Mar 31, 2020 at 10:14 PM Paul Rogers
> <par0328@yahoo.com.invalid>
> wrote:
>> Hi Rafael,
>> Thanks much for the info. We had already implemented filter push-down
>> for other plugins, and for a few custom REST APIs, so should be
>> possible to port it over to the HTTP plugin. If you can supply code,
>> then you can convert filters to anything you want, a specialized JSON 
>> request body, etc.
>> To do this generically, we have to make some assumptions, such as
>> either 1) all fields can be pushed as query parameters, or 2) only
>> those in some config list. Either way, we know how to create
>> name=value pairs in either a GET or POST format.
>> You mentioned that your "payload" objects are structured. Drill can
>> already handle this; your query can map them to the top level:
>> SELECT t.characteristic.color.name AS color_name,
>> t.characteristic.color.confidence AS color_confidence, ...  FROM
>> yourTable AS t
>> You'll get that "out of box." Drill does assume that data is in
>> "record
>> format": a single list of objects which represent records. Code would
>> be needed to handle, say, two separate lists of objects or other,
>> more-general, JSON structures.
>> My specific question was more around the response from your web service.
>> Does that have extra material besides just the data records? Something 
>> like:
>> { "status": "ok", "data": [ {characteristic: ... }, {...}] }
>> Or, is the response directly an array of objects:
>> [ {characteristic: ... }, {...}]
>> If it is just an array, then the "out of the box" plugin will work.
>> If there is other stuff, then you'll need the new feature to tell
>> Drill how to find the field to your data. The present version needs
>> code, but I'm thinking we can just use an array of names in the plugin 
>> config:
>> dataPath: [ "data" ],
>> Or, in your case, do you get a single record per HTTP request? If a
>> single record, then either your queries will be super-simple, or
>> performance will be horrible when requesting multiple records. (The
>> HTTP plugin only does one request and assumes it will get back a set
>> of records as a JSON array or as whitespace-separated JSON objects as
>> in a JSON file.)
>> Can you clarify a bit which of these cases your data follows?
>> I like your idea of optionally supplying a parser class for the "hard"
>> cases:
>> messageParserClass: "com.mycompany.drill.MyMessageParser",
>> As long as the class is on the classpath, Java will find it.
>> Finally, on the filter push-down, the existing code we're thinking of
>> using can handle expressions of the form:
>> column op constant
>> Where "op" is one of the relational operators: =, !=, < etc. Also
>> handles the obvious variations (const op constant, column BETWEEN
>> const1 AND const2, column IN (const1, const2, ...)).
>> The code cannot handle expressions (due to a limitation in Drill itself).
>> That is, this won't work as a filter push-down: col = 10 + 2 or col +
>> 2 = 10. Nor can it handle multi-column expressions: column1 = column2, etc.
>> I'll write up something more specific so you can see exactly what we
>> propose.
>> Thanks,
>> - Paul
>>    On Tuesday, March 31, 2020, 6:39:57 PM PDT, Rafael Jaimes III <
>>rafjaimes@gmail.com> wrote:
>> Either a text description of the parse path or specifying the class
>> with the message parser could work.
>> I think the latter would be better, if it were simple as dropping the
>> JAR in 3rdparty after Drill is already built.
>> That way we can just continually add parsers ad-hoc.
>> An example JSON response includes about 4 top-level fields, then 2 of
>> those fields have many sub-fields.
>> For example a field could be nested 3 levels deep and say:
>> Characteristic:
>>  Color:
>>      Color name: "Red"
>>      Confidence: 100
>>  Physical:
>>      Size: 405
>>      Confidence:  95
>> As you can imagine, it would be difficult to flatten this because of
>> repeated sub-field names like "Confidence".
>> I don't think it would be easily exportable into a CSV.
>> At least for me pandas dataframe is the ultimate destination for all
>> of this, which also don't handle nested fields well either.
>> I'll have to handle some parsing on my end.
>> Filter pushdown would be huge and much desired.
>> Our other end-users are accustomed to using SQL in that manner and
>> the REST API we use fully support AND, OR, BETWEEN, =, <, >, etc (I
>> can get a full list if you're interested).
>> For example I think "between" is a ",". Converting the SQL statement
>> into the URL format would be awesome and help streamline querying
>> across data sources.
>> This is one of the main reasons why we're so interested in Drill.
>> Thanks,
>> Rafael

View raw message