hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vihang Karajgaonkar (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HIVE-17696) Vectorized reader does not seem to be pushing down projection columns in certain code paths
Date Thu, 26 Oct 2017 21:31:01 GMT

    [ https://issues.apache.org/jira/browse/HIVE-17696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16221231#comment-16221231
] 

Vihang Karajgaonkar edited comment on HIVE-17696 at 10/26/17 9:30 PM:
----------------------------------------------------------------------

Hi [~Ferd] I took a look at the patch version 2. It seems like the original issue still remains
unresolved. The patch refactors to remove code duplication which is good but in the method
{{getRequestedSchema}} shouldn't the following line number 396 in DataWritableReadSupport:

{noformat}
396	      return fileSchema;
{noformat}

be returning tableSchema?

Am I missing something here?


was (Author: vihangk1):
Hi [~Ferd] I took a look at the patch version 2. It seems like the original issue still remains
unresolved. The patch refactors to remove code duplication which is good by in the method
{{getRequestedSchema}} shouldn't the following line:

396	      return fileSchema;
be 
396	      return tableSchema;

Am I missing something here?

> Vectorized reader does not seem to be pushing down projection columns in certain code
paths
> -------------------------------------------------------------------------------------------
>
>                 Key: HIVE-17696
>                 URL: https://issues.apache.org/jira/browse/HIVE-17696
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: Vihang Karajgaonkar
>            Assignee: Ferdinand Xu
>             Fix For: 3.0.0
>
>         Attachments: HIVE-17696.2.patch, HIVE-17696.patch
>
>
> This is the code snippet from {{VectorizedParquetRecordReader.java}}
> {noformat}
> MessageType tableSchema;
>     if (indexAccess) {
>       List<Integer> indexSequence = new ArrayList<>();
>       // Generates a sequence list of indexes
>       for(int i = 0; i < columnNamesList.size(); i++) {
>         indexSequence.add(i);
>       }
>       tableSchema = DataWritableReadSupport.getSchemaByIndex(fileSchema, columnNamesList,
>         indexSequence);
>     } else {
>       tableSchema = DataWritableReadSupport.getSchemaByName(fileSchema, columnNamesList,
>         columnTypesList);
>     }
>     indexColumnsWanted = ColumnProjectionUtils.getReadColumnIDs(configuration);
>     if (!ColumnProjectionUtils.isReadAllColumns(configuration) && !indexColumnsWanted.isEmpty())
{
>       requestedSchema =
>         DataWritableReadSupport.getSchemaByIndex(tableSchema, columnNamesList, indexColumnsWanted);
>     } else {
>       requestedSchema = fileSchema;
>     }
>     this.reader = new ParquetFileReader(
>       configuration, footer.getFileMetaData(), file, blocks, requestedSchema.getColumns());
> {noformat}
> Couple of things to notice here:
> Most of this code is duplicated from {{DataWritableReadSupport.init()}} method. 
> the else condition passes in fileSchema instead of using tableSchema like we do in DataWritableReadSupport.init()
method. Does this cause projection columns to be missed when we read parquet files? We should
probably just reuse ReadContext returned from {{DataWritableReadSupport.init()}} method here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message