hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jakub Havlík (JIRA) <j...@apache.org>
Subject [jira] [Updated] (HIVE-10278) Hive does not use Parquet projection to access structures
Date Wed, 15 Apr 2015 08:03:58 GMT

     [ https://issues.apache.org/jira/browse/HIVE-10278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jakub Havlík updated HIVE-10278:
--------------------------------
    Priority: Blocker  (was: Major)

> Hive does not use Parquet projection to access structures
> ---------------------------------------------------------
>
>                 Key: HIVE-10278
>                 URL: https://issues.apache.org/jira/browse/HIVE-10278
>             Project: Hive
>          Issue Type: Bug
>          Components: File Formats, Hive, Physical Optimizer, Query Planning, Query Processor,
Types
>    Affects Versions: 1.0.0
>         Environment: CentOS 6.5, Cloudera 2.5.0-cdh5.3.0, 120 nodes in a cluster.
>            Reporter: Jakub Havlík
>            Priority: Blocker
>              Labels: performance
>
> Selection from table stored in Parquet format with structures does not uses projections
as per Parquet specification. This means that reading just one item from structure results
in reading the whole structure. It was found by following test:
> Two tables (one flat one with structures) were created as follows:
> drop table if exists test_flat;
> create table test_flat
>   (urlurl string,
>    urlvalid boolean,
>    urlhost string,
>    urldomain string,
>    urlsubdomain string,
>    urlprotocol string,
>    urlsuffix string,
>    urlmiddomain string,   
>    refererurl string,
>    referervalid boolean,
>    refererhost string,
>    refererdomain string,
>    referersubdomain string,
>    refererprotocol string,
>    referersuffix string,
>    referermiddomain string)
> stored as parquet
> ; 
> drop table if exists test_struct;
> create table test_struct
>   (url struct<url:string, valid:boolean, host:string, domain:string, subdomain:string,
protocol:string, suffix:string, middomain:string>,
>    referer struct<url:string, valid:boolean, host:string, domain:string, subdomain:string,
protocol:string, suffix:string, middomain:string>)
> stored as parquet; 
> Size of these tables is:
> [havlik@ams07-015 ~]$ hdfs dfs -du -s -h /results/havlik/new_calibration/test_flat/
> 820.4 G  1.6 T  /results/havlik/new_calibration/test_flat
> [havlik@ams07-015 ~]$ hdfs dfs -du -s -h /results/havlik/new_calibration/test_struct/
> 822.6 G  1.6 T  /results/havlik/new_calibration/test_struct
> Flat SELECT:
> select 
>     count(*)
> from 
>     test_struct
> where
>     url.valid = true
>     and referer.valid = true;
> Struct SELECT:
> select 
>     count(*)
> from 
>     test_flat
> where
>     urlvalid = true
>     and referervalid = true;
> CPU time:
> flat: 11785 seconds
> struct: 38004 seconds
> HDFS bytes read:
> flat: 1 812 148 468
> struct: 883 774 856 844 (which is total size of the table)
> Using own MapReduce it is possible to use projections into structures to get results
similar to flat table. It is clear that Hive needs to implement it as it creates unnecessary
disk reading and CPU time overhead and cripples performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message