hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eugene Koifman (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVE-20901) running compactor when there is nothing to do produces duplicate data
Date Sat, 09 Feb 2019 00:30:00 GMT

     [ https://issues.apache.org/jira/browse/HIVE-20901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Eugene Koifman updated HIVE-20901:
----------------------------------
    Description: 
suppose we run minor compaction 2 times, via alter table

The 2nd request to compaction should have nothing to do but I don't think there is a check
for that.  It's visible in the context of HIVE-20823, where each compactor run produces
a delta with new visibility suffix so we end up with something like
{noformat}
target/tmp/org.apache.hadoop.hive.ql.TestTxnCommands3-1541810844849/warehouse/t/

├── delete_delta_0000001_0000002_v0000019
│   ├── _orc_acid_version
│   └── bucket_00000
├── delete_delta_0000001_0000002_v0000021
│   ├── _orc_acid_version
│   └── bucket_00000
├── delta_0000001_0000001_0000
│   ├── _orc_acid_version
│   └── bucket_00000
├── delta_0000001_0000002_v0000019
│   ├── _orc_acid_version
│   └── bucket_00000
├── delta_0000001_0000002_v0000021
│   ├── _orc_acid_version
│   └── bucket_00000
└── delta_0000002_0000002_0000
    ├── _orc_acid_version
    └── bucket_00000{noformat}
i.e. 2 deltas with the same write ID range

this is bad.  Probably happens today as well but new run produces a delta with the same name
and clobbers the previous one, which may interfere with writers

 

need to investigate

 

-The issue (I think) is that {{AcidUtils.getAcidState()}} then returns both deltas as if they
were distinct and it effectively duplicates data.-  There is no data duplication - {{getAcidState()}}
will not use 2 deltas with the same {{writeid}} range

 

 

  was:
suppose we run minor compaction 2 times, via alter table

The 2nd request to compaction should have nothing to do but I don't think there is a check
for that.  It's visible in the context of HIVE-20823, where each compactor run produces
a delta with new visibility suffix so we end up with something like
{noformat}
target/tmp/org.apache.hadoop.hive.ql.TestTxnCommands3-1541810844849/warehouse/t/

├── delete_delta_0000001_0000002_v0000019
│   ├── _orc_acid_version
│   └── bucket_00000
├── delete_delta_0000001_0000002_v0000021
│   ├── _orc_acid_version
│   └── bucket_00000
├── delta_0000001_0000001_0000
│   ├── _orc_acid_version
│   └── bucket_00000
├── delta_0000001_0000002_v0000019
│   ├── _orc_acid_version
│   └── bucket_00000
├── delta_0000001_0000002_v0000021
│   ├── _orc_acid_version
│   └── bucket_00000
└── delta_0000002_0000002_0000
    ├── _orc_acid_version
    └── bucket_00000{noformat}
i.e. 2 deltas with the same write ID range

this is bad.  Probably happens today as well but new run produces a delta with the same name
and clobbers the previous one, which may interfere with writers

 

need to investigate

 

-The issue (I think) is that {{AcidUtils.getAcidState()}} then returns both deltas as if they
were distinct and it effectively duplicates data.-  There is no data duplication - {{getAcidState()}}
will use 2 deltas with the same \{{writeid}} range

 

 


> running compactor when there is nothing to do produces duplicate data
> ---------------------------------------------------------------------
>
>                 Key: HIVE-20901
>                 URL: https://issues.apache.org/jira/browse/HIVE-20901
>             Project: Hive
>          Issue Type: Bug
>          Components: Transactions
>    Affects Versions: 4.0.0
>            Reporter: Eugene Koifman
>            Assignee: Eugene Koifman
>            Priority: Major
>
> suppose we run minor compaction 2 times, via alter table
> The 2nd request to compaction should have nothing to do but I don't think there is a
check for that.  It's visible in the context of HIVE-20823, where each compactor run produces
a delta with new visibility suffix so we end up with something like
> {noformat}
> target/tmp/org.apache.hadoop.hive.ql.TestTxnCommands3-1541810844849/warehouse/t/
> ├── delete_delta_0000001_0000002_v0000019
> │   ├── _orc_acid_version
> │   └── bucket_00000
> ├── delete_delta_0000001_0000002_v0000021
> │   ├── _orc_acid_version
> │   └── bucket_00000
> ├── delta_0000001_0000001_0000
> │   ├── _orc_acid_version
> │   └── bucket_00000
> ├── delta_0000001_0000002_v0000019
> │   ├── _orc_acid_version
> │   └── bucket_00000
> ├── delta_0000001_0000002_v0000021
> │   ├── _orc_acid_version
> │   └── bucket_00000
> └── delta_0000002_0000002_0000
>     ├── _orc_acid_version
>     └── bucket_00000{noformat}
> i.e. 2 deltas with the same write ID range
> this is bad.  Probably happens today as well but new run produces a delta with the same
name and clobbers the previous one, which may interfere with writers
>  
> need to investigate
>  
> -The issue (I think) is that {{AcidUtils.getAcidState()}} then returns both deltas as
if they were distinct and it effectively duplicates data.-  There is no data duplication
- {{getAcidState()}} will not use 2 deltas with the same {{writeid}} range
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message