hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sahil Takiar (JIRA)" <j...@apache.org>
Subject [jira] [Assigned] (HIVE-20270) Don't serialize hashCode for groupByKey
Date Mon, 30 Jul 2018 14:59:00 GMT

     [ https://issues.apache.org/jira/browse/HIVE-20270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sahil Takiar reassigned HIVE-20270:
-----------------------------------


> Don't serialize hashCode for groupByKey
> ---------------------------------------
>
>                 Key: HIVE-20270
>                 URL: https://issues.apache.org/jira/browse/HIVE-20270
>             Project: Hive
>          Issue Type: Bug
>          Components: Spark
>            Reporter: Sahil Takiar
>            Assignee: Sahil Takiar
>            Priority: Major
>
> Similar to HIVE-20032, but for {{groupByKey}}. The tricky part with {{groupByKey}} is
we need to preserve the {{hashCode}} until the key gets partitioned (via the {{HashPartitioner}})
but after that we don't really need to preserve the {{hashCode}}. The {{groupByKey}} operator
in Spark does require a {{hashCode}} since it puts everything in a map, but it can use a different
hash-code than the one specified in {{HiveKey}}. The hashcode in {{HiveKey}} is only important
for determining the partition the key should be assigned to.
> The drawback is that computing the hashcode for each {{HiveKey}} might require more CPU
resources, but we should profile it just in case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message