calcite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrei Sereda <and...@sereda.cc>
Subject Re: calcite overhead for simple queries (optimization and conversion phase)
Date Wed, 01 May 2019 15:57:27 GMT
In case somebody is interested, here is an isolated test to compare
performance  :
https://gist.github.com/asereda-gs/9f09f3c46904c4b6670aeca602164c6b

I get the following results (10x difference) my previous comparison was
with elastic search (now it is HSQLDB) :

=== Column columnCount: 10
direct count:200 50%:0 80%:0 90%:0 95%:0 max:1
calcite count:200 50%:25 80%:28 90%:30 95%:32 max:40

=== Column columnCount: 100
direct count:200 50%:3 80%:3 90%:3 95%:3 max:5
calcite count:200 50%:48 80%:49 90%:50 95%:51 max:60

=== Column columnCount: 500
direct count:200 50%:19 80%:20 90%:21 95%:22 max:29
calcite count:200 50%:190 80%:195 90%:200 95%:204 max:241

=== Column columnCount: 1000
direct count:200 50%:43 80%:47 90%:50 95%:51 max:56
calcite count:200 50%:331 80%:342 90%:352 95%:361 max:389

It's possible that my test is doing something wrong or I'm measuring
different operation (independent "audit" will be appreciated).

I will check if I can narrow down the issue with a profiler.

On Tue, Apr 30, 2019 at 4:07 PM Julian Hyde <jhyde@apache.org> wrote:

> > As temporary work-around can I re-use PreparedStatement ?
>
> If you mean re-execute PreparedStatement, yes. Its main purpose is to only
> pay the cost of preparation once, but execute multiple times (possibly with
> different parameter values).
>
> But each time you execute it will be the same query and the same plan.
>
> Julian
>
>
> > On Apr 30, 2019, at 11:43 AM, Andrei Sereda <andrei@sereda.cc> wrote:
> >
> >> Consider using Hep planner rather than Volcano planner.
> > Will check with Hep. Working on isolated unit test.
> >
> >> If you reduce the number of columns (to say 10), does the time reduce
> > significantly? That might be a clue that there is a performance bug
> > somewhere.
> > Number of columns seems to be correlated with performance penalty:
> > - for 10 columns:  120 (raw) vs 150ms (calcite)
> > - for 50 columns:  260 (raw) vs 740ms (calcite)
> >
> >> Are these numbers on the first query, or after the system has warmed up?
> > After. Following warmup I'm running 100 queries sequentially.
> >
> > As temporary work-around can I re-use PreparedStatement ?
> >
> > On Tue, Apr 30, 2019 at 2:06 PM Julian Hyde <jhyde@apache.org> wrote:
> >
> >> Consider using Hep planner rather than Volcano planner. (Measure the
> >> number of rule firings. Is it higher than you think is necessary, given
> the
> >> complexity of the query?)
> >>
> >> If you reduce the number of columns (to say 10), does the time reduce
> >> significantly? That might be a clue that there is a performance bug
> >> somewhere.
> >>
> >> Are these numbers on the first query, or after the system has warmed up?
> >>
> >> Julian
> >>
> >>
> >>> On Apr 30, 2019, at 9:41 AM, Andrei Sereda <andrei@sereda.cc> wrote:
> >>>
> >>> Hello,
> >>>
> >>> One of our applications uses Calcite as translation layer between SQL
> and
> >>> destination source (mongo, elastic, etc.). The queries are fairly
> simple
> >>> and similar to the one below:
> >>>
> >>> select col1, col2, agg3(col3), agg4(col4), ..., aggN(colN) from table
> >>> where id in (1, 2, 3) group by col1, col2
> >>>
> >>> The only complexity is that number of columns can be fairly large (up
> to
> >>> 150) but otherwise it is a standard aggregation with some simple
> >> predicates
> >>> (no joins). Number of rows is small and usually is less than 1k.
> >>>
> >>> We have observed that overhead for such queries is 2x-3x (95th
> >> percentile)
> >>> compared to executing produced queries directly on the data-source (eg.
> >>> mongo / elastic query). Difference is in the order of 100ms: 200ms
> >> (direct)
> >>> vs 600ms (calcite). Unfortunately such latency is noticeable in UI.
> >>>
> >>> Originally I thought it has to do with compilation time (janino) but
> >>> profiling showed that most of time is spent in the following methods:
> >>>
> >>>  1. .preprare.Prepare.optimize() (VolcanoPanner)
> >>>  2. .sql2rel.SqlToRelConverter.convertQuery()
> >>>
> >>> What can be done to avoid such overhead ?
> >>>
> >>>  1. Have avatica / calcite connection cache
> >> connection.prepareStatement()
> >>>  so same optimization is not done twice ? Manually re-running same
> >>>  PreparedStatement helps.
> >>>  2. Use interpreter ?
> >>>  3. Manually re-use PreparedQuery (eg. from Cache<String,
> >>>  PreparedStatement>) ? This introduces other complexities like
> executing
> >>>  same query in parallel.
> >>>  4. Minimize number of Rules ?
> >>>  5. Cache logical plan (RelNode) ?
> >>>  6. Anything else ?
> >>>
> >>> Many thanks in advance.
> >>>
> >>> Andrei.
> >>
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message