drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Rogers <par0...@yahoo.com.INVALID>
Subject Re: Apache Drill rest api plugin
Date Tue, 24 Mar 2020 07:03:47 GMT
Hi Arun,

If I understand you, the Parquet file format is essentially unimportant. You have your own
in-memory structures that happen to be populated from Parquet.

You probably have some form of REST API that, at the least, includes projection and filtering.
That is, I can say which columns I want (projection) and which rows (filtering).

The API delivers data in some format. If a REST API, that format is probably JSON, though
JSON is horribly inefficient for a big data, high-speed query engine. The data would, ideally,
be in some compact, easy-to-parse binary format.

There is no out-of-the-box storage plugin to that will do everything you want. However, Drill
is designed to be extensible; it is not hard to build such a plugin. For example, I've now
built a couple that do something similar, and Charles is working on a generic HTTP version.


There are a number of resources to help. One is our book Learning Apache Drill. Another is
a set of notes on my wiki from when I built a similar plugin. [1] Charles mentioned his in-flight
REST API which gives you much of what you need except the filter push-down. [2]


There are two minor challenges. The first is just learning how to build Drill and assemble
the pieces needed. The book and Wiki can help. The other is to build the "filter push-down"
logic that translates from Drill's internal parse tree for filters to the format that your
REST API needs. Basically, you pull predicates out of the query (WHERE a=10 AND b="fred")
and pass them along using your REST request. There is a framework to help with filter push
downs in [3]. The framework converts from Drill's filter format to a compact (col op const)
format that handles most of the cases you'll need, at least when getting started.


The obvious way to pass the predicates is via an HTTP GET query string. However, predicates
can be complex; some people find it better to pass the predicates encoded in JSON via an HTTP
POST request.

If your data does return in JSON, you can use the existing JSON parser to read the data. See
PR 1892 for an example. We are also working on an improved JSON reader which will be available
in a couple of weeks (if all goes well.)

Thanks,
- Paul


[1] https://github.com/paul-rogers/drill/wiki/Create-a-Storage-Plugin
[2] https://github.com/apache/drill/pull/1892
[3] https://github.com/apache/drill/pull/1914 

    On Monday, March 23, 2020, 11:03:01 PM PDT, Arun Sathianathan <arun.ns@gmail.com>
wrote:  
 
 Hi Paul,
Thanks for getting back. Let me rephrase the question for clarity. We have an architecture
where parquet files in ECS are read into memory and hosted in in-memory structures. We then
have API exposed to users that would return data from memory via REST API. We would like to
know if we can query the REST API using Apache Drill. The authentication to API is via OAUTH2. 
We were also pointed to below enhancement in pipeline. HTTPS://github.con/apache/Drill/pull/1892 

Regards,Arun 

Sent from my iPhone

On 24-Mar-2020, at 12:10 AM, Paul Rogers <par0328@yahoo.com> wrote:


Hi Navin,

Can you share a bit more what you are trying to do? ECS is Elastic Container Service, correct?
So, the Parquet files are ephemeral: they exist only while the container runs? Do the files
have a permanent form, such as in S3?

Parquet is a complex format. Drill exploits the Parquet structure to optimize query performance.
This means that Drill must seek the the header, footer and row groups of each file. More specifically,
Parquet cannot be read in a streaming fashion the way we can read CSV or JSON.

The best REST API for Parquet would be a clone of the Amazon S3 API. Alternatively, expose
the files using something like NFS so that the file on ECS appears like a local file to Drill.

You can even implement the HDFS client API on top of your REST API (assuming your REST API
supports the required functions), and use Drill's DFS plugin with your client.


Yet another alternative is to store Parquet in S3, so Drill can use the S3 API directly. Or,
to stream the content to Drill from a container, use JSON or CSV.

Lots of options that depend on what you're trying to do.

Thanks,

- Paul

 

    On Monday, March 23, 2020, 6:03:48 AM PDT, Charles Givre <cgivre@gmail.com> wrote:
 
 
 Hi Navin, 
Thanks for your interest in Drill.  To answer your question, there is currently a pull request
for a REST storage plugin [1], however as implemented it only accepts JSON responses.  However,
it would not be difficult to get the reader to accept Parquet files.  Please take a look
and send any feedback.
-- C


[1]: https://github.com/apache/drill/pull/1892 <https://github.com/apache/drill/pull/1892>

> On Mar 23, 2020, at 8:14 AM, Navin Bhawsar <navin.bhawsar@gmail.com> wrote:
> 
> Hi
> 
> We are currently doing an experiment to use apache drill to query parquet
> files .These parquet files will be copied on ecs and exposed via rest api .
> 
> Can you please advise if there is a storage plugin to query rest api ?
> 
> Currently we are using Apache Drill 1.17 version in distributed mode .
> 
> Please let me know if you need more details .
> 
> Thanks and Regards,
> Navin
    
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message