hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (Jira)" <j...@apache.org>
Subject [jira] [Work logged] (HIVE-23597) VectorizedOrcAcidRowBatchReader::ColumnizedDeleteEventRegistry reads delete delta directories multiple times
Date Thu, 11 Jun 2020 12:59:00 GMT

     [ https://issues.apache.org/jira/browse/HIVE-23597?focusedWorklogId=444254&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-444254
]

ASF GitHub Bot logged work on HIVE-23597:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 11/Jun/20 12:58
            Start Date: 11/Jun/20 12:58
    Worklog Time Spent: 10m 
      Work Description: pvary commented on a change in pull request #1081:
URL: https://github.com/apache/hive/pull/1081#discussion_r438762099



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java
##########
@@ -1561,24 +1572,22 @@ public int compareTo(CompressedOwid other) {
       try {
         final Path[] deleteDeltaDirs = getDeleteDeltaDirsFromSplit(orcSplit);
         if (deleteDeltaDirs.length > 0) {
+          FileSystem fs = orcSplit.getPath().getFileSystem(conf);
+          AcidOutputFormat.Options orcSplitMinMaxWriteIds =
+              AcidUtils.parseBaseOrDeltaBucketFilename(orcSplit.getPath(), conf);
           int totalDeleteEventCount = 0;
           for (Path deleteDeltaDir : deleteDeltaDirs) {
-            FileSystem fs = deleteDeltaDir.getFileSystem(conf);
+            if (!isQualifiedDeleteDeltaForSplit(orcSplitMinMaxWriteIds, deleteDeltaDir))
{
+              continue;
+            }
             Path[] deleteDeltaFiles = OrcRawRecordMerger.getDeltaFiles(deleteDeltaDir, bucket,
                 new OrcRawRecordMerger.Options().isCompacting(false), null);
             for (Path deleteDeltaFile : deleteDeltaFiles) {
-              // NOTE: Calling last flush length below is more for future-proofing when we
have
-              // streaming deletes. But currently we don't support streaming deletes, and
this can
-              // be removed if this becomes a performance issue.
-              long length = OrcAcidUtils.getLastFlushLength(fs, deleteDeltaFile);
+              // NOTE: When streaming deletes are supported, consider using OrcAcidUtils.getLastFlushLength(fs,
deleteDeltaFile)
               // NOTE: A check for existence of deleteDeltaFile is required because we may
not have
               // deletes for the bucket being taken into consideration for this split processing.
-              if (length != -1 && fs.exists(deleteDeltaFile)) {
-                /**
-                 * todo: we have OrcSplit.orcTail so we should be able to get stats from
there
-                 */
-                Reader deleteDeltaReader = OrcFile.createReader(deleteDeltaFile,
-                    OrcFile.readerOptions(conf).maxLength(length));
+              if (fs.exists(deleteDeltaFile)) {

Review comment:
       Looks good, let's get a green run :)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 444254)
    Time Spent: 1h  (was: 50m)

> VectorizedOrcAcidRowBatchReader::ColumnizedDeleteEventRegistry reads delete delta directories
multiple times
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-23597
>                 URL: https://issues.apache.org/jira/browse/HIVE-23597
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Rajesh Balamohan
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 1h
>  Remaining Estimate: 0h
>
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java#L1562]
> {code:java}
> try {
>         final Path[] deleteDeltaDirs = getDeleteDeltaDirsFromSplit(orcSplit);
>         if (deleteDeltaDirs.length > 0) {
>           int totalDeleteEventCount = 0;
>           for (Path deleteDeltaDir : deleteDeltaDirs) {
> {code}
>  
> Consider a directory layout like the following. This was created by having simple set
of "insert --> update --> select" queries.
>  
> {noformat}
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/base_0000001
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/base_0000002
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_0000003_0000003_0000
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_0000004_0000004_0000
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_0000005_0000005_0000
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_0000006_0000006_0000
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_0000007_0000007_0000
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_0000008_0000008_0000
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_0000009_0000009_0000
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_0000010_0000010_0000
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_0000011_0000011_0000
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_0000012_0000012_0000
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_0000013_0000013_0000
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_0000003_0000003_0000
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_0000004_0000004_0000
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_0000005_0000005_0000
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_0000006_0000006_0000
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_0000007_0000007_0000
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_0000008_0000008_0000
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_0000009_0000009_0000
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_0000010_0000010_0000
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_0000011_0000011_0000
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_0000012_0000012_0000
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_0000013_0000013_0000
{noformat}
>  
> Orcsplit contains all the delete delta folder information. For the directory layout like
this, it would create {{~12 splits}}. For every split, it constructs "ColumnizedDeleteEventRegistry"
in VRBAcidReader and ends up reading all these delete delta folders multiple times.
>  In this case, it would read it approximately {{121 times!}}.
> This causes huge delay in running simple queries like "{{select * from tab_x}}" in cloud
storage. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message