hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sahil Takiar (JIRA)" <>
Subject [jira] [Assigned] (HIVE-20270) Don't serialize hashCode for groupByKey
Date Mon, 30 Jul 2018 14:59:00 GMT


Sahil Takiar reassigned HIVE-20270:

> Don't serialize hashCode for groupByKey
> ---------------------------------------
>                 Key: HIVE-20270
>                 URL:
>             Project: Hive
>          Issue Type: Bug
>          Components: Spark
>            Reporter: Sahil Takiar
>            Assignee: Sahil Takiar
>            Priority: Major
> Similar to HIVE-20032, but for {{groupByKey}}. The tricky part with {{groupByKey}} is
we need to preserve the {{hashCode}} until the key gets partitioned (via the {{HashPartitioner}})
but after that we don't really need to preserve the {{hashCode}}. The {{groupByKey}} operator
in Spark does require a {{hashCode}} since it puts everything in a map, but it can use a different
hash-code than the one specified in {{HiveKey}}. The hashcode in {{HiveKey}} is only important
for determining the partition the key should be assigned to.
> The drawback is that computing the hashcode for each {{HiveKey}} might require more CPU
resources, but we should profile it just in case.

This message was sent by Atlassian JIRA

View raw message