hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marta Kuczora (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-21407) Parquet predicate pushdown is not working correctly for char column types
Date Sun, 14 Apr 2019 12:30:00 GMT

    [ https://issues.apache.org/jira/browse/HIVE-21407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16817290#comment-16817290
] 

Marta Kuczora commented on HIVE-21407:
--------------------------------------

The idea behind the patch is that for CHAR columns extend the predicate which is pushed to
Parquet with an “or” clause which contains the same expression with a padded and a stripped
value.
Example:
column c is a CHAR(10) type and the search expression is c='apple'
The predicate which is pushed to Parquet looked like c='apple     ' before the patch and it
would look like (c='apple     ' or c='apple') after the patch.
Since the value 'apple' is stored in Parquet without padding, the predicate before the patch
didn’t return any rows. With the patch it will return the correct row. 
Since on predicate level, there is no distinction between CHAR or VARCHAR, the predicates
for VARCHARs will be changed as well, so the result set returned from Parquet will be wider
than before.
Example:
A table contains a c VARCHAR(10) column and there is a row where c='apple' and there is an
other row where c='apple  '. If the search expression is c='apple  ', both rows will be returned
from Parquet after the patch. But since Hive is doing an additional filtering on the rows
returned from Parquet, it won’t be a problem, the result set returned by Hive will contain
only the row with the value 'apple  '.


> Parquet predicate pushdown is not working correctly for char column types
> -------------------------------------------------------------------------
>
>                 Key: HIVE-21407
>                 URL: https://issues.apache.org/jira/browse/HIVE-21407
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 4.0.0
>            Reporter: Marta Kuczora
>            Assignee: Marta Kuczora
>            Priority: Major
>         Attachments: HIVE-21407.patch
>
>
> If the 'hive.optimize.index.filter' parameter is false, the filter predicate is not pushed
to parquet, so the filtering only happens within Hive. If the parameter is true, the filter
is pushed to parquet, but for a char type, the value which is pushed to Parquet will be padded
with spaces:
> {noformat}
>   @Override
>   public void setValue(String val, int len) {
>     super.setValue(HiveBaseChar.getPaddedValue(val, len), -1);
>   }
> {noformat} 
> So if we have a char(10) column which contains the value "apple" and the where condition
looks like 'where c='apple'', the value pushed to Paquet will be 'apple' followed by 5 spaces.
But the stored values are not padded, so no rows will be returned from Parquet.
> How to reproduce:
> {noformat}
> $ create table ppd (c char(10), v varchar(10), i int) stored as parquet;
> $ insert into ppd values ('apple', 'bee', 1),('apple', 'tree', 2),('hello', 'world',
1),('hello','vilag',3);
> $ set hive.optimize.ppd.storage=true;
> $ set hive.vectorized.execution.enabled=true;
> $ set hive.vectorized.execution.enabled=false;
> $ set hive.optimize.ppd=true;
> $ set hive.optimize.index.filter=true;
> $ set hive.parquet.timestamp.skip.conversion=false;
> $ select * from ppd where c='apple';
> +--------+--------+--------+
> | ppd.c  | ppd.v  | ppd.i  |
> +--------+--------+--------+
> +--------+--------+--------+
> $ set hive.optimize.index.filter=false; or set hive.optimize.ppd.storage=false;
> $ select * from ppd where c='apple';
> +-------------+--------+--------+
> |    ppd.c    | ppd.v  | ppd.i  |
> +-------------+--------+--------+
> | apple       | bee    | 1      |
> | apple       | tree   | 2      |
> +-------------+--------+--------+
> {noformat}
> The issue surfaced after uploading the fix for [HIVE-21327|https://issues.apache.org/jira/browse/HIVE-21327]
was uploaded upstream. Before the HIVE-21327 fix, setting the parameter 'hive.parquet.timestamp.skip.conversion'
to true in the parquet_ppd_char.q test hid this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message