drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Charles Givre <cgi...@gmail.com>
Subject Re: REST data source?
Date Thu, 02 Apr 2020 17:31:39 GMT
Rafael, 
I'd like to echo what Paul said.  It's super-helpful to know how people are actually using
this and what features are needed.  Thanks again!
-- C



> On Apr 2, 2020, at 1:18 PM, Paul Rogers <par0328@yahoo.com.INVALID> wrote:
> 
> Hi Rafael,
> 
> Thanks for the update! We were thinking to try to finish up the current PR so we can
get it merged into Drill. Then, we can add a simpler way to handle the extra message fields,
and add the filter push-down code. We look forward to you continued advice as we add those
additional features.
> 
> Thanks,
> - Paul
> 
> 
> 
>    On Thursday, April 2, 2020, 7:40:08 AM PDT, Jaimes, Rafael - 0993 - MITLL <rafael.jaimes@ll.mit.edu>
wrote:  
> 
> Hi all,
> 
> Just an update after testing HTTP REST plugin some more. It's working well.
> I'm not sure how common or standardized these operators are, but in case it
> is useful to you, I've been using the following:
> = =~  %3C %3E .. , 
> equals, not equals, less than, greater than, between, in
> 
> Let me know if you have any questions or if additional testing would help.
> 
> Thanks,
> Rafael
> 
> -----Original Message-----
> From: Jaimes, Rafael - 0993 - MITLL <Rafael.Jaimes@ll.mit.edu> 
> Sent: Wednesday, April 1, 2020 12:43 PM
> To: user@drill.apache.org
> Subject: RE: REST data source?
> 
> Yes that's correct. I saw the work you started with the env vars, but for
> now I set the proxy in the plugin.
> 
> - Rafael
> 
> -----Original Message-----
> From: Charles Givre <cgivre@gmail.com> 
> Sent: Wednesday, April 1, 2020 12:42 PM
> To: user@drill.apache.org
> Subject: Re: REST data source?
> 
> Hey Rafael, 
> Thanks for the feedback.  My original idea was to pull the proxy from the
> environment vars in HTTP_PROXY/HtTTPS_PROXY and ALL_PROXY but that part
> isn't quite done yet. Did you set the proxy info via the plugin config?
> -- C
> 
> 
>> On Apr 1, 2020, at 10:22 AM, Jaimes, Rafael - 0993 - MITLL
> <Rafael.Jaimes@ll.mit.edu> wrote:
>> 
>> Hi all,
>> 
>> I built Charles' latest branch including the proxy setup. It appears to be
> 
>> working quite well going through the proxy.
>> 
>> I'll continue to test and report back if I find any issues.
>> 
>> Note: Beyond Paul's repo recommendations, I had to skip checkstyle to get
> the 
>> maven build to complete. You're probably already aware of that, I think
> it's 
>> just specific to this branch.
>> 
>> Thanks!
>> Rafael
>> 
>> -----Original Message-----
>> From: Paul Rogers <par0328@yahoo.com.INVALID>
>> Sent: Wednesday, April 1, 2020 1:29 AM
>> To: user <user@drill.apache.org>
>> Subject: Re: REST data source?
>> 
>> Thanks, Charles.
>> 
>> As Charles suggested, I pushed a commit that replaces the "old" JSON
> reader 
>> with the new EVF-based one. Eventually this will allow us to use a
> "provided 
>> schema" to handle any JSON ambiguities.
>> 
>> As we've been discussing, I'll try to add the ability to specify a path to
> 
>> data: "response/payload/records" or whatever. With the present commit,
> that 
>> path can be parsed in code, but I think a simple path spec would be
> easier.
>> 
>> Thanks,
>> - Paul
>> 
>> 
>> 
>>     On Tuesday, March 31, 2020, 10:00:52 PM PDT, Charles Givre 
>> <cgivre@gmail.com> wrote:
>> 
>> Hello all,
>> I pushed some updates to the REST PR to include initial work on proxy 
>> configuration.  I haven't updated the docs yet (until this is finalized).
> It 
>> adds new config variables as shown below:
>> 
>> {
>>   "type": "http",
>>   "cacheResults": true,
>>   "connections": {},
>>   "timeout": 0,
>>   "proxyHost": null,
>>   "proxyPort": 0,
>>   "proxyType": null,
>>   "proxyUsername": null,
>>   "proxyPassword": null,
>>   "enabled": true
>> }
>> I started on getting Drill to recognize the proxy info from the
> environment, 
>> but haven't quite finished that.  The plan is for the plugin config to 
>> override environment vars.
>> Feedback is welcome.
>> 
>> @paul-rogers, I think you can push to my branch (or submit a PR?) and that
> 
>> will be included in the main PR.
>> -- C
>> 
>> 
>> 
>>> On Mar 31, 2020, at 10:40 PM, Rafael Jaimes III <rafjaimes@gmail.com>
> wrote:
>>> 
>>> Yes your initial assessment was correct, there is extra material other
>>> than the data field.
>>> The returned JSON has some top-level fields that don't go any deeper,
>>> akin to your "status" : ok field. In the example I'm running now, one
>>> is called MessageState which is set to "NEW". There's another field
>>> called MessageData, which, obviously, holds most of the data. There
>>> are some other top-level fields, and one is called MessageHeader which
>>> is nested. There's a lot of stuff here, and this is just one "table" I'm 
>>> querying against now.
>>> Not sure how it will differ with the other services.
>>> 
>>> The service is definitely returning multiple records - I believe it's
>>> a JSON array and Drill+HTTP/plugin appears to handle it quite well.
>>> 
>>> You're right, Drill is handling most of the structure by modifying my
>>> SELECT statement as you suggested.
>>> 
>>> For filter pushdown, expressions of that form would be great. That's
>>> what I had in mind too.
>>> 
>>> Thanks,
>>> Rafael
>>> 
>>> On Tue, Mar 31, 2020 at 10:14 PM Paul Rogers
>>> <par0328@yahoo.com.invalid>
>>> wrote:
>>> 
>>>> Hi Rafael,
>>>> 
>>>> Thanks much for the info. We had already implemented filter push-down
>>>> for other plugins, and for a few custom REST APIs, so should be
>>>> possible to port it over to the HTTP plugin. If you can supply code,
>>>> then you can convert filters to anything you want, a specialized JSON 
>>>> request body, etc.
>>>> To do this generically, we have to make some assumptions, such as
>>>> either 1) all fields can be pushed as query parameters, or 2) only
>>>> those in some config list. Either way, we know how to create
>>>> name=value pairs in either a GET or POST format.
>>>> 
>>>> You mentioned that your "payload" objects are structured. Drill can
>>>> already handle this; your query can map them to the top level:
>>>> 
>>>> SELECT t.characteristic.color.name AS color_name,
>>>> t.characteristic.color.confidence AS color_confidence, ...  FROM
>>>> yourTable AS t
>>>> 
>>>> You'll get that "out of box." Drill does assume that data is in
>>>> "record
>>>> format": a single list of objects which represent records. Code would
>>>> be needed to handle, say, two separate lists of objects or other,
>>>> more-general, JSON structures.
>>>> 
>>>> 
>>>> My specific question was more around the response from your web service.
>>>> Does that have extra material besides just the data records? Something 
>>>> like:
>>>> 
>>>> 
>>>> { "status": "ok", "data": [ {characteristic: ... }, {...}] }
>>>> 
>>>> Or, is the response directly an array of objects:
>>>> 
>>>> [ {characteristic: ... }, {...}]
>>>> 
>>>> 
>>>> If it is just an array, then the "out of the box" plugin will work.
>>>> If there is other stuff, then you'll need the new feature to tell
>>>> Drill how to find the field to your data. The present version needs
>>>> code, but I'm thinking we can just use an array of names in the plugin 
>>>> config:
>>>> 
>>>> dataPath: [ "data" ],
>>>> 
>>>> Or, in your case, do you get a single record per HTTP request? If a
>>>> single record, then either your queries will be super-simple, or
>>>> performance will be horrible when requesting multiple records. (The
>>>> HTTP plugin only does one request and assumes it will get back a set
>>>> of records as a JSON array or as whitespace-separated JSON objects as
>>>> in a JSON file.)
>>>> 
>>>> Can you clarify a bit which of these cases your data follows?
>>>> 
>>>> I like your idea of optionally supplying a parser class for the "hard"
>>>> cases:
>>>> 
>>>> messageParserClass: "com.mycompany.drill.MyMessageParser",
>>>> 
>>>> As long as the class is on the classpath, Java will find it.
>>>> 
>>>> Finally, on the filter push-down, the existing code we're thinking of
>>>> using can handle expressions of the form:
>>>> 
>>>> column op constant
>>>> 
>>>> Where "op" is one of the relational operators: =, !=, < etc. Also
>>>> handles the obvious variations (const op constant, column BETWEEN
>>>> const1 AND const2, column IN (const1, const2, ...)).
>>>> 
>>>> The code cannot handle expressions (due to a limitation in Drill
> itself).
>>>> That is, this won't work as a filter push-down: col = 10 + 2 or col +
>>>> 2 = 10. Nor can it handle multi-column expressions: column1 = column2,
> etc.
>>>> 
>>>> 
>>>> I'll write up something more specific so you can see exactly what we
>>>> propose.
>>>> 
>>>> 
>>>> Thanks,
>>>> - Paul
>>>> 
>>>> 
>>>> 
>>>>   On Tuesday, March 31, 2020, 6:39:57 PM PDT, Rafael Jaimes III <
>>>> rafjaimes@gmail.com> wrote:
>>>> 
>>>> Either a text description of the parse path or specifying the class
>>>> with the message parser could work.
>>>> I think the latter would be better, if it were simple as dropping the
>>>> JAR in 3rdparty after Drill is already built.
>>>> That way we can just continually add parsers ad-hoc.
>>>> 
>>>> An example JSON response includes about 4 top-level fields, then 2 of
>>>> those fields have many sub-fields.
>>>> For example a field could be nested 3 levels deep and say:
>>>> 
>>>> Characteristic:
>>>> 
>>>> Color:
>>>> 
>>>>     Color name: "Red"
>>>> 
>>>>     Confidence: 100
>>>> 
>>>> Physical:
>>>> 
>>>>     Size: 405
>>>> 
>>>>     Confidence:  95
>>>> 
>>>> As you can imagine, it would be difficult to flatten this because of
>>>> repeated sub-field names like "Confidence".
>>>> 
>>>> I don't think it would be easily exportable into a CSV.
>>>> At least for me pandas dataframe is the ultimate destination for all
>>>> of this, which also don't handle nested fields well either.
>>>> I'll have to handle some parsing on my end.
>>>> 
>>>> Filter pushdown would be huge and much desired.
>>>> Our other end-users are accustomed to using SQL in that manner and
>>>> the REST API we use fully support AND, OR, BETWEEN, =, <, >, etc (I
>>>> can get a full list if you're interested).
>>>> For example I think "between" is a ",". Converting the SQL statement
>>>> into the URL format would be awesome and help streamline querying
>>>> across data sources.
>>>> This is one of the main reasons why we're so interested in Drill.
>>>> 
>>>> 
>>>> Thanks,
>>>> 
>>>> Rafael
>>>> 
>> 
> 


Mime
View raw message