spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ofir Manor <ofir.ma...@equalum.io>
Subject Re: ORC v/s Parquet for Spark 2.0
Date Thu, 28 Jul 2016 17:46:48 GMT
BTW - this thread has many anecdotes on Apache ORC vs. Apache Parquet (I
personally think both are great at this point).
But the original question was about Spark 2.0. Anyone has some insights
about Parquet-specific optimizations / limitations vs. ORC-specific
optimizations / limitations in pre-2.0 vs. 2.0? I've put one in the
beginning of the thread regarding Structured Streaming, but there was a
general claim that pre-2.0 Spark was missing many ORC optimizations, and
that some (all?) were added in 2.0.
I saw that a lot of related tickets closed in 2.0, but it would great if
someone close to the details can explain.

Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.manor@equalum.io

On Thu, Jul 28, 2016 at 6:49 PM, Mich Talebzadeh <mich.talebzadeh@gmail.com>
wrote:

> Like anything else your mileage varies.
>
> ORC with Vectorised query execution
> <https://cwiki.apache.org/confluence/display/Hive/Vectorized+Query+Execution> is
> the nearest one can get to proper Data Warehouse like SAP IQ or Teradata
> with columnar indexes. To me that is cool. Parquet has been around and has
> its use case as well.
>
> I guess there is no hard and fast rule which one to use all the time. Use
> the one that provides best fit for the condition.
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 28 July 2016 at 09:18, Jörn Franke <jornfranke@gmail.com> wrote:
>
>> I see it more as a process of innovation and thus competition is good.
>> Companies just should not follow these religious arguments but try
>> themselves what suits them. There is more than software when using software
>> ;)
>>
>> On 28 Jul 2016, at 01:44, Mich Talebzadeh <mich.talebzadeh@gmail.com>
>> wrote:
>>
>> And frankly this is becoming some sort of religious arguments now
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 28 July 2016 at 00:01, Sudhir Babu Pothineni <sbpothineni@gmail.com>
>> wrote:
>>
>>> It depends on what you are dong, here is the recent comparison of ORC,
>>> Parquet
>>>
>>>
>>> https://www.slideshare.net/mobile/oom65/file-format-benchmarks-avro-json-orc-parquet
>>>
>>> Although from ORC authors, I thought fair comparison, We use ORC as
>>> System of Record on our Cloudera HDFS cluster, our experience is so far
>>> good.
>>>
>>> Perquet is backed by Cloudera, which has more installations of Hadoop.
>>> ORC is by Hortonworks, so battle of file format continues...
>>>
>>> Sent from my iPhone
>>>
>>> On Jul 27, 2016, at 4:54 PM, janardhan shetty <janardhanp22@gmail.com>
>>> wrote:
>>>
>>> Seems like parquet format is better comparatively to orc when the
>>> dataset is log data without nested structures? Is this fair understanding ?
>>> On Jul 27, 2016 1:30 PM, "Jörn Franke" <jornfranke@gmail.com> wrote:
>>>
>>>> Kudu has been from my impression be designed to offer somethings
>>>> between hbase and parquet for write intensive loads - it is not faster for
>>>> warehouse type of querying compared to parquet (merely slower, because that
>>>> is not its use case).   I assume this is still the strategy of it.
>>>>
>>>> For some scenarios it could make sense together with parquet and Orc.
>>>> However I am not sure what the advantage towards using hbase + parquet and
>>>> Orc.
>>>>
>>>> On 27 Jul 2016, at 11:47, "Uwe@Moosheimer.com <Uwe@moosheimer.com>"
<
>>>> Uwe@Moosheimer.com <Uwe@moosheimer.com>> wrote:
>>>>
>>>> Hi Gourav,
>>>>
>>>> Kudu (if you mean Apache Kuda, the Cloudera originated project) is a in
>>>> memory db with data storage while Parquet is "only" a columnar
>>>> storage format.
>>>>
>>>> As I understand, Kudu is a BI db to compete with Exasol or Hana (ok ...
>>>> that's more a wish :-).
>>>>
>>>> Regards,
>>>> Uwe
>>>>
>>>> Mit freundlichen Grüßen / best regards
>>>> Kay-Uwe Moosheimer
>>>>
>>>> Am 27.07.2016 um 09:15 schrieb Gourav Sengupta <
>>>> gourav.sengupta@gmail.com>:
>>>>
>>>> Gosh,
>>>>
>>>> whether ORC came from this or that, it runs queries in HIVE with TEZ at
>>>> a speed that is better than SPARK.
>>>>
>>>> Has anyone heard of KUDA? Its better than Parquet. But I think that
>>>> someone might just start saying that KUDA has difficult lineage as well.
>>>> After all dynastic rules dictate.
>>>>
>>>> Personally I feel that if something stores my data compressed and makes
>>>> me access it faster I do not care where it comes from or how difficult the
>>>> child birth was :)
>>>>
>>>>
>>>> Regards,
>>>> Gourav
>>>>
>>>> On Tue, Jul 26, 2016 at 11:19 PM, Sudhir Babu Pothineni <
>>>> sbpothineni@gmail.com> wrote:
>>>>
>>>>> Just correction:
>>>>>
>>>>> ORC Java libraries from Hive are forked into Apache ORC. Vectorization
>>>>> default.
>>>>>
>>>>> Do not know If Spark leveraging this new repo?
>>>>>
>>>>> <dependency>
>>>>>  <groupId>org.apache.orc</groupId>
>>>>>     <artifactId>orc</artifactId>
>>>>>     <version>1.1.2</version>
>>>>>     <type>pom</type>
>>>>> </dependency>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Sent from my iPhone
>>>>> On Jul 26, 2016, at 4:50 PM, Koert Kuipers <koert@tresata.com>
wrote:
>>>>>
>>>>> parquet was inspired by dremel but written from the ground up as a
>>>>> library with support for a variety of big data systems (hive, pig, impala,
>>>>> cascading, etc.). it is also easy to add new support, since its a proper
>>>>> library.
>>>>>
>>>>> orc bas been enhanced while deployed at facebook in hive and at yahoo
>>>>> in hive. just hive. it didn't really exist by itself. it was part of
the
>>>>> big java soup that is called hive, without an easy way to extract it.
hive
>>>>> does not expose proper java apis. it never cared for that.
>>>>>
>>>>> On Tue, Jul 26, 2016 at 9:57 AM, Ovidiu-Cristian MARCU <
>>>>> ovidiu-cristian.marcu@inria.fr> wrote:
>>>>>
>>>>>> Interesting opinion, thank you
>>>>>>
>>>>>> Still, on the website parquet is basically inspired by Dremel
>>>>>> (Google) [1] and part of orc has been enhanced while deployed for
Facebook,
>>>>>> Yahoo [2].
>>>>>>
>>>>>> Other than this presentation [3], do you guys know any other
>>>>>> benchmark?
>>>>>>
>>>>>> [1]https://parquet.apache.org/documentation/latest/
>>>>>> [2]https://orc.apache.org/docs/
>>>>>> [3]
>>>>>> http://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet
>>>>>>
>>>>>> On 26 Jul 2016, at 15:19, Koert Kuipers <koert@tresata.com>
wrote:
>>>>>>
>>>>>> when parquet came out it was developed by a community of companies,
>>>>>> and was designed as a library to be supported by multiple big data
>>>>>> projects. nice
>>>>>>
>>>>>> orc on the other hand initially only supported hive. it wasn't even
>>>>>> designed as a library that can be re-used. even today it brings in
the
>>>>>> kitchen sink of transitive dependencies. yikes
>>>>>>
>>>>>> On Jul 26, 2016 5:09 AM, "Jörn Franke" <jornfranke@gmail.com>
wrote:
>>>>>>
>>>>>>> I think both are very similar, but with slightly different goals.
>>>>>>> While they work transparently for each Hadoop application you
need to
>>>>>>> enable specific support in the application for predicate push
down.
>>>>>>> In the end you have to check which application you are using
and do
>>>>>>> some tests (with correct predicate push down configuration).
Keep in mind
>>>>>>> that both formats work best if they are sorted on filter columns
(which is
>>>>>>> your responsibility) and if their optimatizations are correctly
configured
>>>>>>> (min max index, bloom filter, compression etc) .
>>>>>>>
>>>>>>> If you need to ingest sensor data you may want to store it first
in
>>>>>>> hbase and then batch process it in large files in Orc or parquet
format.
>>>>>>>
>>>>>>> On 26 Jul 2016, at 04:09, janardhan shetty <janardhanp22@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Just wondering advantages and disadvantages to convert data into
ORC
>>>>>>> or Parquet.
>>>>>>>
>>>>>>> In the documentation of Spark there are numerous examples of
Parquet
>>>>>>> format.
>>>>>>>
>>>>>>> Any strong reasons to chose Parquet over ORC file format ?
>>>>>>>
>>>>>>> Also : current data compression is bzip2
>>>>>>>
>>>>>>>
>>>>>>> http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy
>>>>>>> This seems like biased.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>
>

Mime
View raw message