hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sahil Takiar (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-20032) Don't serialize hashCode when groupByShuffle and RDD cacheing is disabled
Date Fri, 13 Jul 2018 02:00:02 GMT

    [ https://issues.apache.org/jira/browse/HIVE-20032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16542413#comment-16542413
] 

Sahil Takiar commented on HIVE-20032:
-------------------------------------

As for benchmarking, I have done a lot of TPC-DS benchmarking, and I don't consistently get
better performance. However, the amount of shuffled data is significantly reduced (as well
as the amount of data spilled to disk). My guess is that latency doesn't improve much because
I'm running my tests on a unloaded cluster. However, I expect cluster throughput to be better
with this patch since less I/O resources are being used. I'll need to run some concurrent
TPC-DS workloads to confirm this though.

> Don't serialize hashCode when groupByShuffle and RDD cacheing is disabled
> -------------------------------------------------------------------------
>
>                 Key: HIVE-20032
>                 URL: https://issues.apache.org/jira/browse/HIVE-20032
>             Project: Hive
>          Issue Type: Improvement
>          Components: Spark
>            Reporter: Sahil Takiar
>            Assignee: Sahil Takiar
>            Priority: Major
>         Attachments: HIVE-20032.1.patch, HIVE-20032.2.patch, HIVE-20032.3.patch, HIVE-20032.4.patch
>
>
> Follow up on HIVE-15104, if we don't enable RDD cacheing or groupByShuffles, then we
don't need to serialize the hashCode when shuffling data in HoS.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message