drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kunal Khatua (JIRA)" <j...@apache.org>
Subject [jira] [Created] (DRILL-6727) JPPD does not eliminate rows using the bloom filter if a HashJoin is involved
Date Sun, 02 Sep 2018 05:41:00 GMT
Kunal Khatua created DRILL-6727:
-----------------------------------

             Summary: JPPD does not eliminate rows using the bloom filter if a HashJoin is
involved
                 Key: DRILL-6727
                 URL: https://issues.apache.org/jira/browse/DRILL-6727
             Project: Apache Drill
          Issue Type: Improvement
          Components: Execution - Flow
    Affects Versions: 1.15.0
         Environment: When testing a simple join between 2 tables, it appears that the Bloom-filter
based predicate pushdown will work only for broadcast joins, but not for hash-based joins.

Since the purpose of the filter is to reduce the number of records being hashed across the
fragments, the runtime does not improve.

Join Query (TPCH dataset):

{code:sql}
select
l.l_orderkey
, sum(l.l_extendedprice * (1 - l.l_discount)) as revenue
, o.o_orderdate
, o.o_shippriority
from
orders o
, lineitem l
where
l.l_orderkey = o.o_orderkey
and o.o_orderdate = date '1994-08-26'
and MOD(o.o_custkey,10) = 1
group by
l.l_orderkey
, o.o_orderdate
, o.o_shippriority
order by
revenue desc
, o.o_orderdate limit 10;

{code}

This generates an output of about 6K rows from the build side, with the expectation of 10M
rows being joined from the probe side.


Following are the results of the following query:

|| Join Mode || Profile || Runtime || Status ||
|BCastJoin w/o JPPD | 2477fa68-a31e-3b97-5469-373845c2b763 | 3.148sec | As expected. 600M
rows are scanned and probed against the locally available hash table. |
|BCastJoin w/ JPPD | 2477fb99-36cb-9bc2-b7fb-c81a52b256d2 | 3.570sec | 04-xx-06 shows a reduction
in rows. 600M rows are scanned, but only 10M rows are probed against the locally available
hash table. |
|
|HashJoin w/o JPPD | 2477f5e8-fff2-fc83-d251-d8be8f92820b | 5.861sec | As expected. 600M rows
are scanned and probed against the hash table. |
|HashJoin w/ JPPD | 2477f6f7-14e0-ca23-d9f7-6b0273c20964 | 8.376sec | 03-xx-07 is not seeing
a reduction in rows. All 600M rows are scanned and probed against the hash table. |

There are a few possibilities of why the RuntimeFilter does not eliminate any rows when a
HashJoin is involved.
1. The RuntimeFilter operator does not have a bloom-filter
2. The RuntimeFilter receives the bloom-filter after the scan completes, because the foreman
has not finished building and distributing the global bloom-filter
3. The RuntimeFilter receives the bloom-filter during the scan, but does not apply it.
            Reporter: Kunal Khatua
            Assignee: weijie.tong
         Attachments: bcastJoin-JPPD_2477fb99-36cb-9bc2-b7fb-c81a52b256d2.json, bcastJoin-default_2477fa68-a31e-3b97-5469-373845c2b763.json,
hashJoin-JPPD_2477f6f7-14e0-ca23-d9f7-6b0273c20964.json, hashJoin-default_2477f5e8-fff2-fc83-d251-d8be8f92820b.json





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message