hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (Jira)" <j...@apache.org>
Subject [jira] [Work logged] (HIVE-24021) Read insert-only tables truncated by Impala correctly
Date Mon, 10 Aug 2020 14:43:00 GMT

     [ https://issues.apache.org/jira/browse/HIVE-24021?focusedWorklogId=468660&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-468660
]

ASF GitHub Bot logged work on HIVE-24021:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 10/Aug/20 14:42
            Start Date: 10/Aug/20 14:42
    Worklog Time Spent: 10m 
      Work Description: klcopp commented on a change in pull request #1384:
URL: https://github.com/apache/hive/pull/1384#discussion_r467952862



##########
File path: ql/src/test/org/apache/hadoop/hive/ql/TestTxnCommandsForMmTable.java
##########
@@ -481,6 +482,61 @@ public void testOperationsOnCompletedTxnComponentsForMmTable() throws
Exception
     verifyDirAndResult(0, true);
   }
 
+  /**
+   * Impala truncates insert-only tables by writing a base directory (like insert overwrite)
containing an empty file
+   * named "_empty". Generally in Hive files beginning with an underscore are hidden, so
here we make sure that Hive
+   * reads these bases correctly.
+   *
+   * @throws Exception
+   */
+  @Test
+  public void testImpalaTruncatedMmTable() throws Exception {
+    FileSystem fs = FileSystem.get(hiveConf);
+    FileStatus[] status;
+
+    Path tblLocation = new Path(TEST_WAREHOUSE_DIR + "/" +
+        (TableExtended.MMTBL).toString().toLowerCase());
+
+    // 1. Insert two rows to an MM table
+    runStatementOnDriver("insert into " + TableExtended.MMTBL + "(a,b) values(1,2)");
+    runStatementOnDriver("insert into " + TableExtended.MMTBL + "(a,b) values(3,4)");
+    status = fs.listStatus(tblLocation, FileUtils.STAGING_DIR_PATH_FILTER);
+    // There should be 2 delta dirs in the location
+    Assert.assertEquals(2, status.length);
+    for (int i = 0; i < status.length; i++) {
+      Assert.assertTrue(status[i].getPath().getName().matches("delta_.*"));
+    }
+
+    // 2. Simulate Impala truncating the table: write a base dir (base_0000003) containing
an empty file.
+    // Hive will name the empty file "000000_0"
+    runStatementOnDriver("insert overwrite  table " + TableExtended.MMTBL + " select * from
"
+        + TableExtended.MMTBL + " where 1=2");
+    status = fs.listStatus(tblLocation, FileUtils.STAGING_DIR_PATH_FILTER);
+    // There should be 2 delta dirs, plus 1 base dir in the location
+    Assert.assertEquals(3, status.length);
+    int baseCount = 0;
+    int deltaCount = 0;
+    for (int i = 0; i < status.length; i++) {
+      String dirName = status[i].getPath().getName();
+      if (dirName.matches("delta_.*")) {
+        deltaCount++;
+      } else {
+        baseCount++;
+      }
+    }
+    Assert.assertEquals(2, deltaCount);
+    Assert.assertEquals(1, baseCount);
+
+    // rename empty file to "_empty"
+    Path basePath = new Path(tblLocation, "base_0000003");
+    Assert.assertTrue("Rename failed",
+        fs.rename(new Path(basePath, "000000_0"), new Path(basePath, "_empty")));
+
+    // 3. Verify query result. Selecting from a truncated table should return nothing.
+    List<String> rs = runStatementOnDriver("select a,b from " + TableExtended.MMTBL
+ " order by a,b");
+    Assert.assertEquals(Collections.emptyList(), rs);
+  }

Review comment:
       Great idea!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 468660)
    Time Spent: 1h 10m  (was: 1h)

> Read insert-only tables truncated by Impala correctly
> -----------------------------------------------------
>
>                 Key: HIVE-24021
>                 URL: https://issues.apache.org/jira/browse/HIVE-24021
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Karen Coppage
>            Assignee: Karen Coppage
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Impala truncates insert-only tables by writing a base directory containing an empty file
named "_empty". (Like Hive should, see HIVE-20137) Generally in Hive a file name beginning
with an underscore connotes a temporary file that isn't supposed to be read by operations
that didn't create it.
>  Before HIVE-23495, getAcidState listed each directory in the table (HdfsUtils#listLocatedStatus)
– and filtered out directories with names beginning with an underscore or period as they
are presumably temporary. This allowed files called "_empty" to be read, since hive checked
the directory name and not the file name.
>  After HIVE-23495, we recursively list each file in the table (AcidUtils#getHdfsDirSnapshots)
with a filter that doesn't accept files with names beginning with an underscore or period
as they are presumably temporary. As a result Hive reads the table data as if the truncate
operation had not happened.
> Since performance in getAcidState is important, probably the best solution is make an
exception in the filter and accept files with the name "_empty".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message