hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eugene Koifman (JIRA)" <>
Subject [jira] [Commented] (HIVE-16177) non Acid to acid conversion doesn't handle _copy_N files
Date Sat, 17 Jun 2017 16:38:02 GMT


Eugene Koifman commented on HIVE-16177:

The file list is sorted to make sure there is consistent ordering for both read and compact.
Compaction needs to process the whole list of files (for a bucket) and assign ROW_IDs consistently.
For read, OrcRawRecordReader just has a split from some file.  So I need to make sure order
them the same way so that the "offset" for the current file is computed the same way as for

Since Hive doesn't restrict the layout of files in a table very well, sorting is the most
general way to do this.
For example, say we realize that some "feature" places bucket files in subdirectories - by
sorting the whole list of "original" files it makes this work with any directory layout.

Putting a Comparator in AcidUtils makes sense.

"totalSize" is probably because I run the tests on Mac.  Stats often differ on Mac.

> non Acid to acid conversion doesn't handle _copy_N files
> --------------------------------------------------------
>                 Key: HIVE-16177
>                 URL:
>             Project: Hive
>          Issue Type: Bug
>          Components: Transactions
>    Affects Versions: 0.14.0
>            Reporter: Eugene Koifman
>            Assignee: Eugene Koifman
>            Priority: Blocker
>         Attachments: HIVE-16177.01.patch, HIVE-16177.02.patch, HIVE-16177.04.patch, HIVE-16177.07.patch,
HIVE-16177.08.patch, HIVE-16177.09.patch, HIVE-16177.10.patch, HIVE-16177.11.patch, HIVE-16177.14.patch,
> {noformat}
> create table T(a int, b int) clustered by (a)  into 2 buckets stored as orc TBLPROPERTIES('transactional'='false')
> insert into T(a,b) values(1,2)
> insert into T(a,b) values(1,3)
> alter table T SET TBLPROPERTIES ('transactional'='true')
> {noformat}
>     //we should now have bucket files 000001_0 and 000001_0_copy_1
> but doesn't know that there can be copy_N
files and numbers rows in each bucket from 0 thus generating duplicate IDs
> {noformat}
> select ROW__ID, INPUT__FILE__NAME, a, b from T
> {noformat}
> produces 
> {noformat}
> {"transactionid":0,"bucketid":1,"rowid":0},file:/Users/ekoifman/dev/hiverwgit/ql/target/tmp/org.apache.hadoop.hive.ql.TestTxnCommands.../warehouse/nonacidorctbl/000001_0,1,2
> {"transactionid\":0,"bucketid":1,"rowid":0},file:/Users/ekoifman/dev/hiverwgit/ql/target/tmp/org.apache.hadoop.hive.ql.TestTxnCommands.../warehouse/nonacidorctbl/000001_0_copy_1,1,3
> {noformat}
> [~owen.omalley], do you have any thoughts on a good way to handle this?
> attached patch has a few changes to make Acid even recognize copy_N but this is just
a pre-requisite.  The new UT demonstrates the issue.
> Futhermore,
> {noformat}
> alter table T compact 'major'
> select ROW__ID, INPUT__FILE__NAME, a, b from T order by b
> {noformat}
> produces 
> {noformat}
> {"transactionid":0,"bucketid":1,"rowid":0}	file:/Users/ekoifman/dev/hiverwgit/ql/target/tmp/org.apache.hadoop.hive.ql.TestTxnCommands....warehouse/nonacidorctbl/base_-9223372036854775808/bucket_00001
1	2
> {noformat}
> HIVE-16177.04.patch has TestTxnCommands.testNonAcidToAcidConversion0() demonstrating
> This is because compactor doesn't handle copy_N files either (skips them)

This message was sent by Atlassian JIRA

View raw message