spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Koert Kuipers <ko...@tresata.com>
Subject Re: ORC v/s Parquet for Spark 2.0
Date Tue, 26 Jul 2016 13:19:53 GMT
when parquet came out it was developed by a community of companies, and was
designed as a library to be supported by multiple big data projects. nice

orc on the other hand initially only supported hive. it wasn't even
designed as a library that can be re-used. even today it brings in the
kitchen sink of transitive dependencies. yikes

On Jul 26, 2016 5:09 AM, "Jörn Franke" <jornfranke@gmail.com> wrote:

> I think both are very similar, but with slightly different goals. While
> they work transparently for each Hadoop application you need to enable
> specific support in the application for predicate push down.
> In the end you have to check which application you are using and do some
> tests (with correct predicate push down configuration). Keep in mind that
> both formats work best if they are sorted on filter columns (which is your
> responsibility) and if their optimatizations are correctly configured (min
> max index, bloom filter, compression etc) .
>
> If you need to ingest sensor data you may want to store it first in hbase
> and then batch process it in large files in Orc or parquet format.
>
> On 26 Jul 2016, at 04:09, janardhan shetty <janardhanp22@gmail.com> wrote:
>
> Just wondering advantages and disadvantages to convert data into ORC or
> Parquet.
>
> In the documentation of Spark there are numerous examples of Parquet
> format.
>
> Any strong reasons to chose Parquet over ORC file format ?
>
> Also : current data compression is bzip2
>
>
> http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy
> This seems like biased.
>
>

Mime
View raw message