calcite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Haisheng Yuan" <h.y...@alibaba-inc.com>
Subject Correlate Join SemiJoin transformation
Date Tue, 04 Jun 2019 00:26:57 GMT
Hi,

Since we have commited CALCITE-2969, and in the next release of 1.21, we will deprecate EquiJoin
and make enumerable hash join support non-equi join condition, the optimization door is open
to us.

Currently Calcite often generates complicated and inefficient plan for boolean context correlated
subquery (putting value context subquery aside, which is more complex during unnesting). e.g.
https://issues.apache.org/jira/browse/CALCITE-2857
https://issues.apache.org/jira/browse/CALCITE-2874

There are many more cases if you want to try and find out. Some underlying reasons are:
1. SemJoin doesn't support non-equi condition
2. Correlate comes with Aggregate in many cases

Sometimes, SubqueryRemoveRule generates Correlate with Aggregate, sometimes generates Join
with Aggregate. Then we use SemiJoinRule to match Join(Rel, Agg) pattern to see whether we
can transform it to semijoin or not. This is counter intuitive, because we already know it
will be a SemiJoin or AntiSemi when we generate Correlate for the subquery. In current way,
we may miss the chance to get back SemiJoin for it, and we almost lose the opportunity to
do IndexScan if there is index available.

In addition, I see 2 rules JoinAddRedundantSemiJoinRule and SemiJoinRemoveRule seem to provide
the chance to do index scan. But through the test cases related with the 2 rules, I don't
see any chance for a better plan with them and they don't seem to be able to generate plan
with index scan. I doubt whether they are used at all. Can someone shed some light on this
and give us some examples, if I get it wrong?

The transformation flow for boolean context sub-query in my head is:
sub-query -> Correate(Semi/AntiSemi) without Aggregate (SubqueryRemoveRule)
Correlate(Semi/AntiSemi) -> NestedLoopJoin (correlate implementation rule)
Correlate(Semi/AntiSemi) -> IndexedNestedLoopJoin (index apply rule)
Correlate(Semi/AntiSemi) -> SemiJoin/AntiSemiJoin (Semi Correlate to Semi Join rule)
SemiJoin -> InnerJoin(rel, Agg) (SemiJoin to Join rule)
InnerJoin(rel, Agg) ->  InnerJoin(Agg, rel) (Join reorder rule)
SemiJoin/InnerJoin -> HashJoin/MergeJoin/NLJ (Join implementation rule)

This is a long shot, and will involve tons of changes.
Any thoughts?

- Haisheng Yuan
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message