hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Maciek Kocon (JIRA)" <>
Subject [jira] [Commented] (HIVE-12334) Partition Map Join
Date Wed, 04 Nov 2015 21:31:27 GMT


Maciek Kocon commented on HIVE-12334:

Interesting. Haven't thought of such expansion but yea it would be logically equivalent, I
As per performance it could be probably done lot more efficient if implemented natively rather
than relying on stitching multiple (hundreds?) UNION ALL queries…

BTW. I assume you wanted "20151209" there on both sides

> Partition Map Join
> ------------------
>                 Key: HIVE-12334
>                 URL:
>             Project: Hive
>          Issue Type: Improvement
>          Components: Logical Optimizer, Physical Optimizer, SQL
>    Affects Versions: 0.13.0, 0.14.0, 0.13.1, 1.0.0, 1.1.0
>            Reporter: Maciek Kocon
>              Labels: gsoc2015
> Logically and functionally bucketing and partitioning are quite similar - both provide
mechanism to segregate and separate the table's data based on its content. Thanks to that
significant further optimisations like [partition] PRUNING or [bucket] MAP JOIN are possible.
> The difference seems to be imposed by design where the PARTITIONing is open/explicit
while BUCKETing is discrete/implicit.
> Partitioning seems to be very common if not a standard feature in all current RDBMS while
BUCKETING seems to be HIVE specific only.
> In a way BUCKETING could be also called by "hashing" or simply "IMPLICIT PARTITIONING".
> Regardless of the fact that these two are recognised as two separate features available
in Hive there should be nothing to prevent leveraging same existing query/join optimisations
across the two.
> Use the same type of optimization as in BUCKETED MAP JOIN for PARTITIONED tables.
> The partition map join could be performed if the tables being joined are partitioned
on the join columns.
> If table A has set partitioning on KEY column and table B is partitioned on KEY column,
the following join
> SELECT /*+ MAPJOIN(b) */ a.key, a.value
> FROM a JOIN b ON a.key = b.key
> can be done on the mapper only. Instead of fetching B completely for each mapper of A,
only the required partitions are fetched. For the query above, the mapper processing partition
key='20151208' for A will only fetch partition for key='20151208' of B.

This message was sent by Atlassian JIRA

View raw message