calcite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Julian Hyde <>
Subject Re: Optimization of join between a small table and a large table
Date Thu, 08 Aug 2019 20:26:51 GMT
When you implement an equi-join as a nested loops join, the right-hand side always has a filter
combining the variable set by the left-hand side and the join column on the right-hand side.

You will need to do a full scan of the right-hand table every time, unless your table is organized
so that you can access a subset that matches the filter.

In your example

   from emp join dept on emp.deptid =
       where = ‘Sales'

The simple plan would be

 Correlate (v := deptno)
   Filter name = ’Sales'
     Scan dept
   Filter deptno = $v
     Scan emp

If “emp” was physically ordered by deptno, the plan could be optimized to

 Correlate (v := deptno)
   Filter name = ’Sales'
     Scan dept
   RangeScan emp deptno = $v

> On Aug 7, 2019, at 11:41 PM, Gabriel Reid <> wrote:
> Hi Stamatis,
> Thank you so much, this is exactly the kind of info I was looking for! I
> think I can figure out how to go forward based on this, thanks again.
> - Gabriel
> On Thu, Aug 8, 2019 at 8:19 AM Stamatis Zampetakis <>
> wrote:
>> Hi Gabriel,
>> What you want indeed seems to be a nested loop join; there has been a
>> relevant discussion in the dev list [1].
>> You may also find relevant the ongoing discussion on CALCITE-2979 [2].
>> Best,
>> Stamatis
>> [1]
>> [2]
>> On Wed, Aug 7, 2019 at 2:56 PM Gabriel Reid <>
>> wrote:
>>> Hi,
>>> I'm currently working on a custom Calcite adapter, and I've got a
>> situation
>>> where I want to join a small table with a large table, basically just
>> using
>>> the small table to filter the large table. Conceptually, I'm talking
>> about
>>> something that is equivalent to the following query:
>>>    select
>>>    from emp join dept on emp.deptid =
>>>        where = 'Sales'
>>> I've got converter rules which to push down filtering on all tables,
>> which
>>> make a big difference in performance. The above query current results in
>> an
>>> EnumerableHashJoin over a filtered scan over the 'dept' table, and an
>>> unfiltered scan over the 'emp' table.
>>> What I would like to accomplish is that this is converted into a (I
>> think)
>>> a nested loop join between 'dept' and emp, so that the filtered scan is
>>> done once over 'dept', and then a filtered scan is done for each entry of
>>> the 'dept' table using the 'id' value from that entry as a filter value
>> on
>>> the scan on the 'emp' table.
>>> I've been able to get a nested loop join to be used, but I haven't
>> managed
>>> to have the 'id' values from the 'dept' table to be used to filter the
>>> 'emp' table. Instead, the full 'emp' table is scanned for each iteration
>> of
>>> the 'dept' table.
>>> And now my questions: are my hopes/expectations here realistic, and/or is
>>> there a much better/easier way of accomplishing the same thing? Is there
>>> prior art somewhere within the Calcite code base or elsewhere? Or should
>>> this just be working by default?
>>> I would assume that this isn't that unusual of a situation, which is why
>> I
>>> was expecting that there would already be something like this somewhere
>> (or
>>> I'm doing the wrong thing), but I haven't managed to find any clear
>>> pointers in any one direction yet.
>>> Thanks in advance for any advice!
>>> - Gabriel

View raw message