db-derby-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Db-derby Wiki] Update of "QueryPlanJoinOrder" by BryanPendleton
Date Sun, 05 Oct 2008 15:55:59 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Db-derby Wiki" for change notification.

The following page has been changed by BryanPendleton:

The comment on the change is:
Edit the discussion from the derby-dev list into a wiki page

New page:
= Understanding Join Order displays in the Statement Execution Plan =

One of the results of running the query optimizer is the determination of the join order.
For a particular join there is a "left" table and a "right" table, and overall there is
an ordering of the various intermediate joins into an overall join order.

Looking at the output from the statement execution plan one might ask :
 * what is the overall join order?
 * and, for a particular join, which table is the left table and which is the right table?

As a general rule, when looking at a given query plan the join order is reflected by the order
in which you see the table names as you scroll from the top of the plan downward.

In the plan "T1" appears first (i.e. closer to the beginning of the plan), then "T3" appears
afterward, so that's a general indication that the join order is "T1", "T3".

Note that the query plan shows the underlying base table names, 
not correlation names or view names or other ways of aliasing base tables, so
with respect to the top-level query in question, i.e. to:
  select x1.j, x2.b from
    (select distinct i,j from t1) x1,
    (select  distinct a,b from t3) x2
  where x1.i = x2.a
  order by x1.j, x2.b
the join order is _technically_ { X1, X2 }.  But the query plan only shows base table access,
so we have to look to see what tables X1 and X2 access.  In this case X1 accesses T1 while
X2 accesses T3, so when we scan the plan and see { T1, T3 }, that effectively implies a join
order of { X1, X2 }. 

On a more general level, the "scan downward" approach to finding the join order works because
the query plan is written in terms of "left" and "right" result sets, as Knut Anders mentioned.
 If I'm joining three tables T1, T2, T3 and the join order chosen by the optimizer is {T2,
T3, T1} the final query tree will look something like:
        /  \
    JOIN_1  T1
      /  \
     T2   T3
Notice how each join node has a "left" and a "right" child.  The query plan is generated in
depth-first traversal order (starting with the root), so the query plan for the above tree
would look something like:
+++ LeftResultSet:
++++++ JoinResultSet_1:
+++++++++ LeftResultSet:  T2
+++++++++ RightResultSet: T3
+++ RightResultSet:       T1
>>From this we can see that the order in which the tables appear in the query plan (reading
top to bottom) will match the order that comes from reading the leaf nodes of the join tree
left-to-right, and that in turn reflects the join order chosen by the optimizer.

=== Extracting join order information automatically from the RUNTIMESTATISTICS output ===

Kathey proposed enhancing the RuntimeStatisticsParser to automatically determine join order
by searching
for the specified strings in order and return true if they are there in the order they are
in the array.

Army observed that when using this to write automated tests, the test author must be careful
to ensure
that the argument array passed to the new function includes *ALL* tables in the query, not
just a subset.
If the query is of the form:
  select ... from
    (select ... from t2, t1, t3 where ...) X1
    (select ... from t1, t2 where ...) X2
  where ...
Assume a test wants to verify that the tables in subquery X2 have a join order of { T2, T1
}, but doesn't really care about the join order of the subquery in X1, nor does it care about
the order of X1 w.r.t. X2.  You'd *still* have to make sure that the array passed into the
ordered search method includes the join order for X1, as well, otherwise the test might incorrectly

For example, if we only check for the join order of the "targeted" subquery X2, meaning we
pass ["T2", "T1"] into the proposed method and ignore X1 altogether, then the test would IN-correctly
PASS for the following query plan:
  +++ JoinNode_0:
  ++++++ LeftResultSet:                  <== This corresponds to X1
  +++++++++ JoinNode_1:
  ++++++++++++ LeftResultSet:
  +++++++++++++++ JoinNode_2:
  ++++++++++++++++++ LeftResultSet: T3
  ++++++++++++++++++ RightResultSet: T2
  ++++++++++++ RightResultSet: T1
  ++++++ RightResultSet:                 <== This corresponds to X2
  +++++++++ JoinNode_3:
  ++++++++++++ LeftResultSet: T1
  ++++++++++++ RightResultSet: T2

If you just search for "T1" followed by "T2", the test will pass because the join order for
X1 matches--but that's wrong because it's really X2 that we wanted to check.

If instead of {"T1", "T2"} you pass in {"T3", "T2", "T1", "T2", "T1"}--i.e. include *ALL*
tables in the query, even the ones that aren't necessarily targeted--then I think you'd get
the desired behavior. The downside to this is that the test will fail if a join order about
which we "don't care" changes (ex. the join order for X1 in this case).  But that's how things
work today with the canon-based test, as well, so even if it's not ideal, at least it wouldn't
really be any worse...

To get the ideal behavior (where the test fails if and only if the "targeted" subquery's join
order is not what is expected) with the proposed orderedSearchStrings() approach, one would
have to ensure that the table names used in the targeted subquery do not appear anywhere else
in the query.  My guess is that you would have to rewrite a good number of tests to guarantee
that, which would probably be non-trivial.

This page was compiled with information posted to the derby-dev mailing list by Manjula, Kathey,
Army and Knut.

View raw message