hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dong Chen (JIRA)" <>
Subject [jira] [Commented] (HIVE-10252) Make PPD work for Parquet in row group level
Date Fri, 10 Apr 2015 03:43:13 GMT


Dong Chen commented on HIVE-10252:

Thanks for your review, Szehon. In my understanding it do something else and we need keep

The filter predicate is used in 2 phases when reading parquet files:
1. Row group level (coarse-grained)
A row group consists of multiple rows. And a parquet file may consists of multiple row groups.
When Hive begin to read Parquet file, it firstly scans each row group and use the filter to
eliminate not-matched ones.

2. Row level (fine-grained)
For the remaining row groups after phase 1, Parquet read each row in them and use the filter
to eliminate not-matched rows and only return partial rows to Hive, in order to save time.

Then I found the 1st phase does not work in Hive currently and this Jira is for fixing it.
The 2nd phase happens in Parquet and {{ParquetInputFormat.setFilterPredicate()}} is used to
pass the filter from Hive to Parquet. So we may need to keep it there.

Thanks for figuring out the confusion, and please let me know if any other questions on it.

> Make PPD work for Parquet in row group level
> --------------------------------------------
>                 Key: HIVE-10252
>                 URL:
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: Dong Chen
>            Assignee: Dong Chen
>         Attachments: HIVE-10252.patch
> In Hive, predicate pushdown figures out the search condition in HQL, serialize it, and
push to file format. ORC could use the predicate to filter stripes. Similarly, Parquet should
use the statics saved in row group to filter not match row group. But it does not work.
> In {{ParquetRecordReaderWrapper}}, it get splits with all row groups (client side), and
push the filter to Parquet for further processing (parquet side). But in  {{ParquetRecordReader.initializeInternalReader()}},
if the splits have already been selected by client side, it will not handle filter again.
> We should make the behavior consistent in Hive. Maybe we could get splits, filter them,
and then pass to parquet. This means using client side strategy.

This message was sent by Atlassian JIRA

View raw message