spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Koert Kuipers <ko...@tresata.com>
Subject Re: ORC v/s Parquet for Spark 2.0
Date Tue, 26 Jul 2016 21:50:33 GMT
parquet was inspired by dremel but written from the ground up as a library
with support for a variety of big data systems (hive, pig, impala,
cascading, etc.). it is also easy to add new support, since its a proper
library.

orc bas been enhanced while deployed at facebook in hive and at yahoo in
hive. just hive. it didn't really exist by itself. it was part of the big
java soup that is called hive, without an easy way to extract it. hive does
not expose proper java apis. it never cared for that.

On Tue, Jul 26, 2016 at 9:57 AM, Ovidiu-Cristian MARCU <
ovidiu-cristian.marcu@inria.fr> wrote:

> Interesting opinion, thank you
>
> Still, on the website parquet is basically inspired by Dremel (Google) [1]
> and part of orc has been enhanced while deployed for Facebook, Yahoo [2].
>
> Other than this presentation [3], do you guys know any other benchmark?
>
> [1]https://parquet.apache.org/documentation/latest/
> [2]https://orc.apache.org/docs/
> [3]
> http://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet
>
> On 26 Jul 2016, at 15:19, Koert Kuipers <koert@tresata.com> wrote:
>
> when parquet came out it was developed by a community of companies, and
> was designed as a library to be supported by multiple big data projects.
> nice
>
> orc on the other hand initially only supported hive. it wasn't even
> designed as a library that can be re-used. even today it brings in the
> kitchen sink of transitive dependencies. yikes
>
> On Jul 26, 2016 5:09 AM, "Jörn Franke" <jornfranke@gmail.com> wrote:
>
>> I think both are very similar, but with slightly different goals. While
>> they work transparently for each Hadoop application you need to enable
>> specific support in the application for predicate push down.
>> In the end you have to check which application you are using and do some
>> tests (with correct predicate push down configuration). Keep in mind that
>> both formats work best if they are sorted on filter columns (which is your
>> responsibility) and if their optimatizations are correctly configured (min
>> max index, bloom filter, compression etc) .
>>
>> If you need to ingest sensor data you may want to store it first in hbase
>> and then batch process it in large files in Orc or parquet format.
>>
>> On 26 Jul 2016, at 04:09, janardhan shetty <janardhanp22@gmail.com>
>> wrote:
>>
>> Just wondering advantages and disadvantages to convert data into ORC or
>> Parquet.
>>
>> In the documentation of Spark there are numerous examples of Parquet
>> format.
>>
>> Any strong reasons to chose Parquet over ORC file format ?
>>
>> Also : current data compression is bzip2
>>
>>
>> http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy
>> This seems like biased.
>>
>>
>

Mime
View raw message