hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eugene Koifman (Jira)" <j...@apache.org>
Subject [jira] [Assigned] (HIVE-21158) Perform update split early
Date Wed, 28 Jul 2021 14:14:04 GMT

     [ https://issues.apache.org/jira/browse/HIVE-21158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Eugene Koifman reassigned HIVE-21158:
-------------------------------------

    Assignee:     (was: Eugene Koifman)

> Perform update split early
> --------------------------
>
>                 Key: HIVE-21158
>                 URL: https://issues.apache.org/jira/browse/HIVE-21158
>             Project: Hive
>          Issue Type: Improvement
>          Components: Transactions
>    Affects Versions: 3.0.0
>            Reporter: Eugene Koifman
>            Priority: Major
>
> Currently Acid 2.0 does U=D+I in the OrcRecordUpdater. This means that all Updates (wide
rows) are shuffled AND sorted.
>  We could modify the the multi-insert statement which results from Merge statement so
that instead of having one of the legs represent Update, we create 2 legs - 1 representing
Delete of original row and 1 representing Insert of the new version.
>  Delete events are very small so sorting them is cheap. The Insert are written to disk
in a sorted way by virtue of how ROW__IDs are generated.
> Exactly the same idea applies to regular Update statement.
> Note that the U=D+I in OrcRecordUpdater needs to be kept to keep [Streaming Mutate API
|https://cwiki.apache.org/confluence/display/Hive/HCatalog+Streaming+Mutation+API] working
on 2.0.
> *This requires that TxnHandler flags 2 Deletes as a conflict - it doesn't currently*
> Incidentally, 2.0 + early split allows updating all columns including bucketing and partition
columns
> What is lock acquisition based on? Need to make sure that conflict detection (write set
tracking) still works
> So we want to transform
> {noformat}
> update T set B = 7 where A=1
> {noformat}
> into
> {noformat}
> from T
> insert into T select ROW__ID where a = 1 SORT BY ROW__ID
> insert into T select a, 7 where a = 1
> {noformat}
> even better to
> {noformat}
> from T where a = 1
> insert into T select ROW__ID SORT BY ROW__ID
> insert into T select a, 7
> {noformat}
> but this won't parse currently.
> This is very similar to how MERGE stmt is handled.
> Need some though on on how WriteSet tracking works. If we don't allow updating partition
column, then even with dynamic partitions TxnHandler.addDynamicPartitions() should see 1 entry
(in Update type) for each partition since both the insert and delete land in the same partition.
If part cols can be updated, then then we may insert a Delete event into P1 and corresponding
Insert event into P2 so addDynamicPartitions() should see both parts. I guess both need to
be recored in Write_Set but with different types. The delete as 'delete' and insert as insert
so that it can conflict with some IOW on the 'new' partition.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message