drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Avner Levy <avner.l...@gmail.com>
Subject Re: exec.queue.enable in drill-embedded
Date Mon, 29 Jun 2020 17:13:09 GMT
Hi Paul & Rafael, I really appreciate your assistance.
My Parquet files are really small (less than 1 MB in most cases) and the
returned JSON is usually less than few MBs.
Moving to 28GB heap, helped (although even with 28GB I get heap issues once
a while).
Since my queries are on a small amount of data (usually a join between two
or three 1MB parquet files), I was thinking of deploying a bunch of
standalone Drill containers with some auto-scaling policy and ELB in front.
This reduced the complexity of managing a cluster and dealing with
downtimes (if there are problems, you just restart the container).
This enables providing a SQL engine over S3 parquet files to other
microservices over REST (limited to small scale queries which suites my
requirements).
But my Drill knowledge if very limited, so any feedback is appreciated.
Thanks,
Avner

On Sun, Jun 28, 2020 at 8:53 PM Paul Rogers <par0328@gmail.com> wrote:

> Hi Avner,
>
> Query queueing is not available in embedded mode: it uses ZK to throttle
> the number of concurrent queries across a cluster; but embedded does not
> have a cluster or use ZK. (If you are running more than a few concurrent
> queries, embedded mode is likely the wrong deployment model anyway.)
>
> The problem here is the use of the REST API. It has horrible performance;
> it buffers the entire result set in memory in a way that overwhelms the
> heap. The REST API was designed to power the Web UI for small queries of <
> few hundred rows. Drill was designed assuming "real" queries would use the
> ODBC, JDBC or native APIs.
>
> That said, there is an in-flight PR designed to fix the heap memory issue
> for REST queries. However, even with that fix, your client must still be
> capable of handling a very large JSON document since rows are not returned
> in a "jsonlines" format or in batches. If you retrieve a million rows, they
> will be in single huge JSON document.
>
> How many rows does the query return? If a few thousand or less, we can
> perhaps finish up the REST fix to solve the issue. Else, consider switching
> to a more scalable API.
>
> How many rows are read from S3? Doing what kind of processing? Simple WHERE
> clause, or is there some ORDER BY, GROUP BY or joins that would cause
> memory use? If just a scan and WHERE clause, then the memory you are using
> should be plenty - once the REST problem is fixed.
>
> Thanks,
>
> - Paul
>
>
> On Sun, Jun 28, 2020 at 3:17 PM Avner Levy <avner.levy@gmail.com> wrote:
>
> > Hi,
> > I'm using Drill 1.18 (master) docker and trying to configure its memory
> > after getting out of heap memory errors:
> > "RESOURCE ERROR: There is not enough heap memory to run this query using
> > the web interface."
> > The docker is serving remote clients through the REST API.
> > The queries are simple selects over tiny parquet files that are stored in
> > S3.
> > It is running on in 16GB container, configured with a heap of 8GB, and
> 8GB
> > direct memory.
> > I tried to use:
> >   exec.queue.enable=true
> >   exec.queue.large=1
> >   exec.queue.small=1
> >
> > and verified it was configured correctly, but I still see queries running
> > concurrently.
> > In addition, the "drill.queries.enqueued" counter remains zero.
> > Is this mechanism supported in drill-embedded?
> >
> > In addition, it seems there is some memory leak, since after a while even
> > with no query running for a while, running a single tiny query still
> gives
> > the same error.
> > Any insight would be highly appreciated :)
> > Thanks,
> >   Avner
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message