hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eugene Koifman (JIRA)" <>
Subject [jira] [Resolved] (HIVE-17206) make a version of Compactor specific to unbucketed tables
Date Sat, 15 Dec 2018 00:59:00 GMT


Eugene Koifman resolved HIVE-17206.
    Resolution: Won't Fix

this would mean the either we have to change ROW_IDs during compaction - which we cannot unless
compaction is made to run under X lock or break the relationship between bucket_N file name
and ROW__ID.bucketid property of the rows in the file - this would mean all delete events
have to be localized at the task rather that just those in the delete_delta/bucketN

> make a version of Compactor specific to unbucketed tables
> ---------------------------------------------------------
>                 Key: HIVE-17206
>                 URL:
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Transactions
>            Reporter: Eugene Koifman
>            Assignee: Eugene Koifman
>            Priority: Major
> current Compactor will work but is not optimized/flexible enough
> The current compactor is designed to generate the number of splits equal to the number
of buckets in the table.   That is the degree of parallelism.
> For unbucketed tables, the same is used but the "number of buckets" is derived from the
files found in the deltas.  For small writes, there will likely be just 1 bucket_00000 file.
 For large writes, the parallelism of the write determines the number of output files.
> Need to make sure Compactor can control parallelism for unbucketed tables as it wishes.
 For example, hash partition all records (by ROW__ID?) into N disjoint sets.

This message was sent by Atlassian JIRA

View raw message