spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From janardhan shetty <janardhan...@gmail.com>
Subject Re: ORC v/s Parquet for Spark 2.0
Date Tue, 26 Jul 2016 03:31:36 GMT
Thanks Timur for the explanation.
What about if the data is  log-data which is delimited(csv or tsv) and
doesn't have too many nestings and are in file formats ?

On Mon, Jul 25, 2016 at 7:38 PM, Timur Shenkao <tsh@timshenkao.su> wrote:

> 1) The opinions on StackOverflow are correct, not biased.
> 2) Cloudera promoted Parquet, Hortonworks - ORC + Tez. When it became
> obvious that just file format is not enough and Impala sucks, then Cloudera
> announced https://vision.cloudera.com/one-platform/ and focused on Spark
> 3) There is a race between ORC & Parquet: after some perfect release ORC
> becomes better & faster, then, several months later, Parquet may outperform.
> 4) If you use "flat" tables --> ORC is better. If you have highly nested
> data with arrays inside of dictionaries (for instance, json that isn't
> flattened) then may be one should choose Parquet
> 5) AFAIK, Parquet has its metadata at the end of the file (correct me if
> something has changed) . It means that Parquet file must be completely read
> & put into RAM. If there is no enough RAM or file somehow is corrupted -->
> problems arise
>
> On Tue, Jul 26, 2016 at 5:09 AM, janardhan shetty <janardhanp22@gmail.com>
> wrote:
>
>> Just wondering advantages and disadvantages to convert data into ORC or
>> Parquet.
>>
>> In the documentation of Spark there are numerous examples of Parquet
>> format.
>>
>> Any strong reasons to chose Parquet over ORC file format ?
>>
>> Also : current data compression is bzip2
>>
>>
>> http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy
>> This seems like biased.
>>
>
>

Mime
View raw message