spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Koert Kuipers <>
Subject Re: ORC v/s Parquet for Spark 2.0
Date Tue, 26 Jul 2016 13:19:53 GMT
when parquet came out it was developed by a community of companies, and was
designed as a library to be supported by multiple big data projects. nice

orc on the other hand initially only supported hive. it wasn't even
designed as a library that can be re-used. even today it brings in the
kitchen sink of transitive dependencies. yikes

On Jul 26, 2016 5:09 AM, "Jörn Franke" <> wrote:

> I think both are very similar, but with slightly different goals. While
> they work transparently for each Hadoop application you need to enable
> specific support in the application for predicate push down.
> In the end you have to check which application you are using and do some
> tests (with correct predicate push down configuration). Keep in mind that
> both formats work best if they are sorted on filter columns (which is your
> responsibility) and if their optimatizations are correctly configured (min
> max index, bloom filter, compression etc) .
> If you need to ingest sensor data you may want to store it first in hbase
> and then batch process it in large files in Orc or parquet format.
> On 26 Jul 2016, at 04:09, janardhan shetty <> wrote:
> Just wondering advantages and disadvantages to convert data into ORC or
> Parquet.
> In the documentation of Spark there are numerous examples of Parquet
> format.
> Any strong reasons to chose Parquet over ORC file format ?
> Also : current data compression is bzip2
> This seems like biased.

View raw message