From issues-return-197127-apmail-hive-issues-archive=hive.apache.org@hive.apache.org Tue Aug 4 10:52:05 2020 Return-Path: X-Original-To: apmail-hive-issues-archive@locus.apache.org Delivered-To: apmail-hive-issues-archive@locus.apache.org Received: from mailroute1-lw-us.apache.org (mailroute1-lw-us.apache.org [207.244.88.153]) by minotaur.apache.org (Postfix) with ESMTP id 30A971AAB0 for ; Tue, 4 Aug 2020 10:52:02 +0000 (UTC) Received: from mail.apache.org (localhost [127.0.0.1]) by mailroute1-lw-us.apache.org (ASF Mail Server at mailroute1-lw-us.apache.org) with SMTP id E1831123DDF for ; Tue, 4 Aug 2020 10:52:01 +0000 (UTC) Received: (qmail 56135 invoked by uid 500); 4 Aug 2020 10:52:01 -0000 Delivered-To: apmail-hive-issues-archive@hive.apache.org Received: (qmail 56104 invoked by uid 500); 4 Aug 2020 10:52:01 -0000 Mailing-List: contact issues-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list issues@hive.apache.org Received: (qmail 56089 invoked by uid 99); 4 Aug 2020 10:52:01 -0000 Received: from mailrelay1-us-west.apache.org (HELO mailrelay1-us-west.apache.org) (209.188.14.139) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 04 Aug 2020 10:52:01 +0000 Received: from jira-he-de.apache.org (static.172.67.40.188.clients.your-server.de [188.40.67.172]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id AF43D40FA9 for ; Tue, 4 Aug 2020 10:52:00 +0000 (UTC) Received: from jira-he-de.apache.org (localhost.localdomain [127.0.0.1]) by jira-he-de.apache.org (ASF Mail Server at jira-he-de.apache.org) with ESMTP id 2BF85780341 for ; Tue, 4 Aug 2020 10:52:00 +0000 (UTC) Date: Tue, 4 Aug 2020 10:52:00 +0000 (UTC) From: "Marta Kuczora (Jira)" To: issues@hive.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Resolved] (HIVE-23763) Query based minor compaction produces wrong files when rows with different buckets Ids are processed by the same FileSinkOperator MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HIVE-23763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marta Kuczora resolved HIVE-23763. ---------------------------------- Resolution: Fixed > Query based minor compaction produces wrong files when rows with different buckets Ids are processed by the same FileSinkOperator > --------------------------------------------------------------------------------------------------------------------------------- > > Key: HIVE-23763 > URL: https://issues.apache.org/jira/browse/HIVE-23763 > Project: Hive > Issue Type: Bug > Components: Transactions > Affects Versions: 4.0.0 > Reporter: Marta Kuczora > Assignee: Marta Kuczora > Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 2h 10m > Remaining Estimate: 0h > > How to reproduce: > - Create an unbucketed ACID table > - Insert a bigger amount of data into this table so there would be multiple bucket files in the table > The files in the table should look like this: > /warehouse/tablespace/managed/hive/bubu_acid/delta_0000001_0000001_0000/bucket_00000_0 > /warehouse/tablespace/managed/hive/bubu_acid/delta_0000001_0000001_0000/bucket_00001_0 > /warehouse/tablespace/managed/hive/bubu_acid/delta_0000001_0000001_0000/bucket_00002_0 > /warehouse/tablespace/managed/hive/bubu_acid/delta_0000001_0000001_0000/bucket_00003_0 > /warehouse/tablespace/managed/hive/bubu_acid/delta_0000001_0000001_0000/bucket_00004_0 > /warehouse/tablespace/managed/hive/bubu_acid/delta_0000001_0000001_0000/bucket_00005_0 > - Do some delete on rows with different bucket Ids > The files in a delete delta should look like this: > /warehouse/tablespace/managed/hive/bubu_acid/delete_delta_0000002_0000002_0000/bucket_00000 > /warehouse/tablespace/managed/hive/bubu_acid/delete_delta_0000006_0000006_0000/bucket_00003 > /warehouse/tablespace/managed/hive/bubu_acid/delete_delta_0000006_0000006_0000/bucket_00001 > - Run the query-based minor compaction > - After the compaction the newly created delete delta containes only 1 bucket file. This file contains rows from all buckets and the table becomes unusable > /warehouse/tablespace/managed/hive/bubu_acid/delete_delta_0000001_0000007_v0000066/bucket_00000 > The issue happens only if rows with different bucket Ids are processed by the same FileSinkOperator. > In the FileSinkOperator.process method, the files for the compaction table are created like this: > {noformat} > if (!bDynParts && !filesCreated) { > if (lbDirName != null) { > if (valToPaths.get(lbDirName) == null) { > createNewPaths(null, lbDirName); > } > } else { > if (conf.isCompactionTable()) { > int bucketProperty = getBucketProperty(row); > bucketId = BucketCodec.determineVersion(bucketProperty).decodeWriterId(bucketProperty); > } > createBucketFiles(fsp); > } > } > {noformat} > When the first row is processed, the file is created and then the filesCreated variable is set to true. Then when the other rows are processed, the first if statement will be false, so no new file gets created, but the row will be written into the file created for the first row. -- This message was sent by Atlassian Jira (v8.3.4#803005)