hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sergey Shelukhin (JIRA)" <>
Subject [jira] [Commented] (HIVE-17458) VectorizedOrcAcidRowBatchReader doesn't handle 'original' files
Date Thu, 02 Nov 2017 22:27:00 GMT


Sergey Shelukhin commented on HIVE-17458:

Left some comments. My main 2 qs are 
1) A patch mentions that non-split-update ACID cannot be read in Hive3. Wouldn't that mean
all the legacy ACID data cannot be read? Reader compat should still be possible.
2) If there are originals only with no deltas, does it still activate the row id machinery?
Looks like it should be unnecessary.

> VectorizedOrcAcidRowBatchReader doesn't handle 'original' files
> ---------------------------------------------------------------
>                 Key: HIVE-17458
>                 URL:
>             Project: Hive
>          Issue Type: Improvement
>    Affects Versions: 2.2.0
>            Reporter: Eugene Koifman
>            Assignee: Eugene Koifman
>            Priority: Critical
>         Attachments: HIVE-17458.01.patch, HIVE-17458.02.patch, HIVE-17458.03.patch, HIVE-17458.04.patch,
HIVE-17458.05.patch, HIVE-17458.06.patch, HIVE-17458.07.patch, HIVE-17458.07.patch, HIVE-17458.08.patch,
HIVE-17458.09.patch, HIVE-17458.10.patch, HIVE-17458.11.patch, HIVE-17458.12.patch, HIVE-17458.12.patch,
HIVE-17458.13.patch, HIVE-17458.14.patch, HIVE-17458.15.patch
> VectorizedOrcAcidRowBatchReader will not be used for original files.  This will likely
look like a perf regression when converting a table from non-acid to acid until it runs through
a major compaction.
> With Load Data support, if large files are added via Load Data, the read ops will not
vectorize until major compaction.  
> There is no reason why this should be the case.  Just like OrcRawRecordMerger, VectorizedOrcAcidRowBatchReader
can look at the other files in the logical tranche/bucket and calculate the offset for the
RowBatch of the split.  (Presumably getRecordReader().getRowNumber() works the same in vector
> In this case we don't even need OrcSplit.isOriginal() - the reader can infer it from
file path... which in particular simplifies OrcInputFormat.determineSplitStrategies()

This message was sent by Atlassian JIRA

View raw message