spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <jornfra...@gmail.com>
Subject Re: parquet vs orc files
Date Wed, 21 Feb 2018 21:40:06 GMT
In the latest version both are equally well supported.

You need to insert the data sorted on filtering columns
Then you will benefit from min max indexes and in case of orc additional from bloom filters,
if you configure them.
In any case I recommend also partitioning of files (do not confuse with Spark partitioning
).

What is best for you you have to figure out in a test. This highly depends on the data and
the analysis you want to do. 

> On 21. Feb 2018, at 21:54, Kane Kim <kane.isturm@gmail.com> wrote:
> 
> Hello,
> 
> Which format is better supported in spark, parquet or orc?
> Will spark use internal sorting of parquet/orc files (and how to test that)?
> Can spark save sorted parquet/orc files? 
> 
> Thanks!

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message