hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marta Kuczora (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-21407) Parquet predicate pushdown is not working correctly for char column types
Date Sun, 14 Apr 2019 13:01:00 GMT

    [ https://issues.apache.org/jira/browse/HIVE-21407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16817303#comment-16817303
] 

Marta Kuczora commented on HIVE-21407:
--------------------------------------

The TestParquetRecordReaderWrapper and the TestParquetFilterPredicate are both testing the
same thing, the behavior of the ParquetFilterPredicateConverter.toFilterPredicate method.
It doesn't make sense to have tests for the same use case in different test classes, so moved
the test cases from the TestParquetRecordReaderWrapper to TestParquetFilterPredicate.

> Parquet predicate pushdown is not working correctly for char column types
> -------------------------------------------------------------------------
>
>                 Key: HIVE-21407
>                 URL: https://issues.apache.org/jira/browse/HIVE-21407
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 4.0.0
>            Reporter: Marta Kuczora
>            Assignee: Marta Kuczora
>            Priority: Major
>         Attachments: HIVE-21407.patch
>
>
> If the 'hive.optimize.index.filter' parameter is false, the filter predicate is not pushed
to parquet, so the filtering only happens within Hive. If the parameter is true, the filter
is pushed to parquet, but for a char type, the value which is pushed to Parquet will be padded
with spaces:
> {noformat}
>   @Override
>   public void setValue(String val, int len) {
>     super.setValue(HiveBaseChar.getPaddedValue(val, len), -1);
>   }
> {noformat} 
> So if we have a char(10) column which contains the value "apple" and the where condition
looks like 'where c='apple'', the value pushed to Paquet will be 'apple' followed by 5 spaces.
But the stored values are not padded, so no rows will be returned from Parquet.
> How to reproduce:
> {noformat}
> $ create table ppd (c char(10), v varchar(10), i int) stored as parquet;
> $ insert into ppd values ('apple', 'bee', 1),('apple', 'tree', 2),('hello', 'world',
1),('hello','vilag',3);
> $ set hive.optimize.ppd.storage=true;
> $ set hive.vectorized.execution.enabled=true;
> $ set hive.vectorized.execution.enabled=false;
> $ set hive.optimize.ppd=true;
> $ set hive.optimize.index.filter=true;
> $ set hive.parquet.timestamp.skip.conversion=false;
> $ select * from ppd where c='apple';
> +--------+--------+--------+
> | ppd.c  | ppd.v  | ppd.i  |
> +--------+--------+--------+
> +--------+--------+--------+
> $ set hive.optimize.index.filter=false; or set hive.optimize.ppd.storage=false;
> $ select * from ppd where c='apple';
> +-------------+--------+--------+
> |    ppd.c    | ppd.v  | ppd.i  |
> +-------------+--------+--------+
> | apple       | bee    | 1      |
> | apple       | tree   | 2      |
> +-------------+--------+--------+
> {noformat}
> The issue surfaced after uploading the fix for [HIVE-21327|https://issues.apache.org/jira/browse/HIVE-21327]
was uploaded upstream. Before the HIVE-21327 fix, setting the parameter 'hive.parquet.timestamp.skip.conversion'
to true in the parquet_ppd_char.q test hid this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message