spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephen Joung <step...@vcnc.co.kr>
Subject Re: parquet vs orc files
Date Thu, 22 Feb 2018 00:37:11 GMT
In case of parquet, best source for me to configure and to ensure "min/max
statistics" was

https://www.slideshare.net/mobile/RyanBlue3/parquet-performance-tuning-the-missing-guide

---

I don't have any experience in orc.

2018년 2월 22일 (목) 오전 6:59, Kane Kim <kane.isturm@gmail.com>님이 작성:

> Thanks, how does min/max index work? Can spark itself configure bloom
> filters when saving as orc?
>
> On Wed, Feb 21, 2018 at 1:40 PM, Jörn Franke <jornfranke@gmail.com> wrote:
>
>> In the latest version both are equally well supported.
>>
>> You need to insert the data sorted on filtering columns
>> Then you will benefit from min max indexes and in case of orc additional
>> from bloom filters, if you configure them.
>> In any case I recommend also partitioning of files (do not confuse with
>> Spark partitioning ).
>>
>> What is best for you you have to figure out in a test. This highly
>> depends on the data and the analysis you want to do.
>>
>> > On 21. Feb 2018, at 21:54, Kane Kim <kane.isturm@gmail.com> wrote:
>> >
>> > Hello,
>> >
>> > Which format is better supported in spark, parquet or orc?
>> > Will spark use internal sorting of parquet/orc files (and how to test
>> that)?
>> > Can spark save sorted parquet/orc files?
>> >
>> > Thanks!
>>
>
>

Mime
View raw message