spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gourav Sengupta <gourav.sengu...@gmail.com>
Subject Re: ORC v/s Parquet for Spark 2.0
Date Wed, 27 Jul 2016 07:15:30 GMT
Gosh,

whether ORC came from this or that, it runs queries in HIVE with TEZ at a
speed that is better than SPARK.

Has anyone heard of KUDA? Its better than Parquet. But I think that someone
might just start saying that KUDA has difficult lineage as well. After all
dynastic rules dictate.

Personally I feel that if something stores my data compressed and makes me
access it faster I do not care where it comes from or how difficult the
child birth was :)


Regards,
Gourav

On Tue, Jul 26, 2016 at 11:19 PM, Sudhir Babu Pothineni <
sbpothineni@gmail.com> wrote:

> Just correction:
>
> ORC Java libraries from Hive are forked into Apache ORC. Vectorization
> default.
>
> Do not know If Spark leveraging this new repo?
>
> <dependency>
>  <groupId>org.apache.orc</groupId>
>     <artifactId>orc</artifactId>
>     <version>1.1.2</version>
>     <type>pom</type>
> </dependency>
>
>
>
>
>
>
>
>
> Sent from my iPhone
> On Jul 26, 2016, at 4:50 PM, Koert Kuipers <koert@tresata.com> wrote:
>
> parquet was inspired by dremel but written from the ground up as a library
> with support for a variety of big data systems (hive, pig, impala,
> cascading, etc.). it is also easy to add new support, since its a proper
> library.
>
> orc bas been enhanced while deployed at facebook in hive and at yahoo in
> hive. just hive. it didn't really exist by itself. it was part of the big
> java soup that is called hive, without an easy way to extract it. hive does
> not expose proper java apis. it never cared for that.
>
> On Tue, Jul 26, 2016 at 9:57 AM, Ovidiu-Cristian MARCU <
> ovidiu-cristian.marcu@inria.fr> wrote:
>
>> Interesting opinion, thank you
>>
>> Still, on the website parquet is basically inspired by Dremel (Google)
>> [1] and part of orc has been enhanced while deployed for Facebook, Yahoo
>> [2].
>>
>> Other than this presentation [3], do you guys know any other benchmark?
>>
>> [1]https://parquet.apache.org/documentation/latest/
>> [2]https://orc.apache.org/docs/
>> [3]
>> http://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet
>>
>> On 26 Jul 2016, at 15:19, Koert Kuipers <koert@tresata.com> wrote:
>>
>> when parquet came out it was developed by a community of companies, and
>> was designed as a library to be supported by multiple big data projects.
>> nice
>>
>> orc on the other hand initially only supported hive. it wasn't even
>> designed as a library that can be re-used. even today it brings in the
>> kitchen sink of transitive dependencies. yikes
>>
>> On Jul 26, 2016 5:09 AM, "Jörn Franke" <jornfranke@gmail.com> wrote:
>>
>>> I think both are very similar, but with slightly different goals. While
>>> they work transparently for each Hadoop application you need to enable
>>> specific support in the application for predicate push down.
>>> In the end you have to check which application you are using and do some
>>> tests (with correct predicate push down configuration). Keep in mind that
>>> both formats work best if they are sorted on filter columns (which is your
>>> responsibility) and if their optimatizations are correctly configured (min
>>> max index, bloom filter, compression etc) .
>>>
>>> If you need to ingest sensor data you may want to store it first in
>>> hbase and then batch process it in large files in Orc or parquet format.
>>>
>>> On 26 Jul 2016, at 04:09, janardhan shetty <janardhanp22@gmail.com>
>>> wrote:
>>>
>>> Just wondering advantages and disadvantages to convert data into ORC or
>>> Parquet.
>>>
>>> In the documentation of Spark there are numerous examples of Parquet
>>> format.
>>>
>>> Any strong reasons to chose Parquet over ORC file format ?
>>>
>>> Also : current data compression is bzip2
>>>
>>>
>>> http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy
>>> This seems like biased.
>>>
>>>
>>
>

Mime
View raw message