hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Abhishek Somani (JIRA)" <>
Subject [jira] [Commented] (HIVE-13479) Relax sorting requirement in ACID tables
Date Tue, 19 Mar 2019 10:13:00 GMT


Abhishek Somani commented on HIVE-13479:

What I meant is now that ACID v2 has been implemented, do we plan to work on relaxing the
sorting requirement? As far as I know, we still enforce that the rows be sorted on the acid
columns(row id), and this is done so that the reader can sort-merge the delete events with
the insert events while reading. Isn't that right?

If so, it seems the only way to have data sorted on another column specified by the user seems
to be to initially insert the data with ordering on that column, so that the data is sorted
BOTH on the acid columns as well as user specified column.

If however we were able to relax the requirement that data HAS to be sorted on the acid columns,
we could utilize something like compaction to sort the data on user desired columns in the
background. Theoretically one could do such sorting in compaction even today, but if the
sorting requirement is not relaxed, we will need to sort both on row ids and user-column,
for which one would need the compaction to behave as an insert overwrite and generate new
row ids so that the data is sorted on both the (new)row id columns as well as the user specified
column, which would be good to avoid.

Have I understood this correct?

> Relax sorting requirement in ACID tables
> ----------------------------------------
>                 Key: HIVE-13479
>                 URL:
>             Project: Hive
>          Issue Type: New Feature
>          Components: Transactions
>    Affects Versions: 1.2.0
>            Reporter: Eugene Koifman
>            Assignee: Eugene Koifman
>            Priority: Major
>   Original Estimate: 160h
>  Remaining Estimate: 160h
> Currently ACID tables require data to be sorted according to internal primary key.  This
is that base + delta files can be efficiently sort/merged to produce the snapshot for current
> This prevents the user to make the table sorted based on any other criteria which can
be useful.  One example is using dynamic partition insert (which also occurs for update/delete
SQL).  This may create lots of writers (buckets*partitions) and tax cluster resources.
> The usual solution is hive.optimize.sort.dynamic.partition=true which won't be honored
for ACID tables.
> We could rely on hash table based algorithm to merge delta files and then not require
any particular sort on Acid tables.  One way to do that is to treat each update event as an
Insert (new internal PK) + delete (old PK).  Delete events are very small since they just
need to contain PKs.  So the hash table would just need to contain Delete events and be reasonably
memory efficient.
> This is a significant amount of work but worth doing.

This message was sent by Atlassian JIRA

View raw message