drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rafael Jaimes III <rafjai...@gmail.com>
Subject Re: exec.queue.enable in drill-embedded
Date Tue, 30 Jun 2020 00:17:27 GMT
Hi Avner,

By cluster that's exactly what I was thinking: multiple deployed
containers. But if you do this I don't recommend the embedded
standalone drillbit. You can run it in distributed mode instead and
then use zookeeper to manage them. It's pretty straightforward except
last time I checked it doesn't really work out of the box with scaling
up pods from the same container. Though, there's been some progress
recently for docker/k8s - if you search the mailing list you'll
probably find it (see also git link below). It's possible the
improvements will be out for 1.18. Maybe there's an easier way to do
all this with Amazon, but I don't have experience with that.

Agirish has done some great work with drill and k8s support:
https://github.com/Agirish

Rafael

On Mon, Jun 29, 2020 at 1:13 PM Avner Levy <avner.levy@gmail.com> wrote:
>
> Hi Paul & Rafael, I really appreciate your assistance.
> My Parquet files are really small (less than 1 MB in most cases) and the
> returned JSON is usually less than few MBs.
> Moving to 28GB heap, helped (although even with 28GB I get heap issues once
> a while).
> Since my queries are on a small amount of data (usually a join between two
> or three 1MB parquet files), I was thinking of deploying a bunch of
> standalone Drill containers with some auto-scaling policy and ELB in front.
> This reduced the complexity of managing a cluster and dealing with
> downtimes (if there are problems, you just restart the container).
> This enables providing a SQL engine over S3 parquet files to other
> microservices over REST (limited to small scale queries which suites my
> requirements).
> But my Drill knowledge if very limited, so any feedback is appreciated.
> Thanks,
> Avner
>
> On Sun, Jun 28, 2020 at 8:53 PM Paul Rogers <par0328@gmail.com> wrote:
>
> > Hi Avner,
> >
> > Query queueing is not available in embedded mode: it uses ZK to throttle
> > the number of concurrent queries across a cluster; but embedded does not
> > have a cluster or use ZK. (If you are running more than a few concurrent
> > queries, embedded mode is likely the wrong deployment model anyway.)
> >
> > The problem here is the use of the REST API. It has horrible performance;
> > it buffers the entire result set in memory in a way that overwhelms the
> > heap. The REST API was designed to power the Web UI for small queries of <
> > few hundred rows. Drill was designed assuming "real" queries would use the
> > ODBC, JDBC or native APIs.
> >
> > That said, there is an in-flight PR designed to fix the heap memory issue
> > for REST queries. However, even with that fix, your client must still be
> > capable of handling a very large JSON document since rows are not returned
> > in a "jsonlines" format or in batches. If you retrieve a million rows, they
> > will be in single huge JSON document.
> >
> > How many rows does the query return? If a few thousand or less, we can
> > perhaps finish up the REST fix to solve the issue. Else, consider switching
> > to a more scalable API.
> >
> > How many rows are read from S3? Doing what kind of processing? Simple WHERE
> > clause, or is there some ORDER BY, GROUP BY or joins that would cause
> > memory use? If just a scan and WHERE clause, then the memory you are using
> > should be plenty - once the REST problem is fixed.
> >
> > Thanks,
> >
> > - Paul
> >
> >
> > On Sun, Jun 28, 2020 at 3:17 PM Avner Levy <avner.levy@gmail.com> wrote:
> >
> > > Hi,
> > > I'm using Drill 1.18 (master) docker and trying to configure its memory
> > > after getting out of heap memory errors:
> > > "RESOURCE ERROR: There is not enough heap memory to run this query using
> > > the web interface."
> > > The docker is serving remote clients through the REST API.
> > > The queries are simple selects over tiny parquet files that are stored in
> > > S3.
> > > It is running on in 16GB container, configured with a heap of 8GB, and
> > 8GB
> > > direct memory.
> > > I tried to use:
> > >   exec.queue.enable=true
> > >   exec.queue.large=1
> > >   exec.queue.small=1
> > >
> > > and verified it was configured correctly, but I still see queries running
> > > concurrently.
> > > In addition, the "drill.queries.enqueued" counter remains zero.
> > > Is this mechanism supported in drill-embedded?
> > >
> > > In addition, it seems there is some memory leak, since after a while even
> > > with no query running for a while, running a single tiny query still
> > gives
> > > the same error.
> > > Any insight would be highly appreciated :)
> > > Thanks,
> > >   Avner
> > >
> >

Mime
View raw message