spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <jornfra...@gmail.com>
Subject Re: orc vs parquet aggregation, orc is really slow
Date Sat, 16 Apr 2016 08:02:08 GMT

Generally a recommendation (besides the issue) - Do not put dates as String. I recommend here
to make them ints. It will be in both cases much faster.

It could be that you load them differently in the tables. Generally for these tables you should
insert them in both cases sorted into the tables.
It could be also that in one case you compress the file and in the other not. It is always
a good practice to have all options in the create table statement - even the default ones.

Hive seems a little bit outdated. Do you use Spark as an execution engine? Then you should
upgrade to newer versions of Hive. The Spark execution engine on hive is still a little bit
more experimental than TEZ. Depends also which distribution you are using.

Normally I would expect both of them to perform similarly.

> On 16 Apr 2016, at 09:20, Maurin Lenglart <maurin@cuberonlabs.com> wrote:
> 
> Hi,
> I am executing one query : 
> “SELECT `event_date` as `event_date`,sum(`bookings`) as `bookings`,sum(`dealviews`)
as `dealviews` FROM myTable WHERE  `event_date` >= '2016-01-06' AND `event_date` <=
'2016-04-02' GROUP BY `event_date` LIMIT 20000”
> 
> My table was created something like :
>   
> CREATE TABLE myTable (
>   bookings            DOUBLE
>   , deal views          INT
>   )
>    STORED AS ORC or PARQUET
>      PARTITION BY (event_date STRING)
> 
> PARQUET take 9second of cumulative CPU
> ORC take 50second of cumulative CPU. 
> 
> For ORC I have tried to hiveContext.setConf(“Spark.Sql.Orc.FilterPushdown”,“true”)
> But it didn’t change anything
> 
> I am missing something, or parquet is better for this type of query?
> 
> I am using spark 1.6.0 with hive 1.1.0
> 
> thanks
> 
> 

Mime
View raw message