hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mustafa Iman (Jira)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-24205) Optimise CuckooSetBytes
Date Fri, 02 Oct 2020 23:38:00 GMT

    [ https://issues.apache.org/jira/browse/HIVE-24205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17206559#comment-17206559
] 

Mustafa Iman commented on HIVE-24205:
-------------------------------------

I added a simple max/min length check in CuckooSetBytes#lookup. Attached file shows some benchmark
results.

 

*TPCH_Q12* is a select with IN clause and a join afterwards. Selectivity of the filter is
30%.

*Synthetic* query ** is Simple select with IN clause. IN is over two of the longest comment
fields (both 72 characters wide). So selectivity is very high at about 2%:

select o_orderkey, o_comment from orders where o_comment in ('jole quickly furiously bold
escapades: regular accounts play regular req', 's foxes. regular warhorses detect fluffily.
carefull 
y regular tithes amo', 'grate ironic, pending sauternes. deposits do are slyly. carefully
ironic')

*Synthetic Wide* query is the same as synthetic except IN clause is over one shortest length
and one longest length comment. Selectivity is still high at 4% but our optimization cannot
eliminate any tuples.

select o_orderkey, o_comment from orders where o_comment in ('jole quickly furiously bold
escapades: regular accounts play regular req', 'ts nag furiously. even');

 

The patch outperforms original code by 50% on synthetic query. For tpch q12, there is no meaningful
difference between two runs. My conclusion is that the optimization is very low overhead and
it gives significant perf improvement in certain cases.

I implemented a vectorized version of the early return from cuckooset. It is attached as vectorized.patch.
However, in all cases simpler patch outperforms vectorized one.

> Optimise CuckooSetBytes
> -----------------------
>
>                 Key: HIVE-24205
>                 URL: https://issues.apache.org/jira/browse/HIVE-24205
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Rajesh Balamohan
>            Assignee: Mustafa Iman
>            Priority: Major
>         Attachments: Screenshot 2020-09-28 at 4.29.24 PM.png, bench.png, vectorized.patch
>
>
> {{FilterStringColumnInList, StringColumnInList}}  etc use CuckooSetBytes for lookup.
> !Screenshot 2020-09-28 at 4.29.24 PM.png|width=714,height=508!
> One option to optimize would be to add boundary conditions on "length" with the min/max
length stored in the hashes (ref: [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/vector/expressions/CuckooSetBytes.java#L85])
. This would significantly reduce the number of hash computation that needs to happen. E.g
[TPCH-Q12|https://github.com/hortonworks/hive-testbench/blob/hdp3/sample-queries-tpch/tpch_query12.sql#L20]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message