spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kevin Tran <kevin...@gmail.com>
Subject Re: Spark app write too many small parquet files
Date Mon, 28 Nov 2016 12:29:11 GMT
Hi Denny,
Thank you for your inputs. I also use 128 MB but still too many files
generated by Spark app which is only ~14 KB each ! That's why I'm asking if
there is a solution for this if some one has same issue.

Cheers,
Kevin.

On Mon, Nov 28, 2016 at 7:08 PM, Denny Lee <denny.g.lee@gmail.com> wrote:

> Generally, yes - you should try to have larger data sizes due to the
> overhead of opening up files.  Typical guidance is between 64MB-1GB;
> personally I usually stick with 128MB-512MB with the default of snappy
> codec compression with parquet.  A good reference is Vida Ha's presentation Data
> Storage Tips for Optimal Spark Performance
> <https://spark-summit.org/2015/events/data-storage-tips-for-optimal-spark-performance/>.
>
>
> On Sun, Nov 27, 2016 at 9:44 PM Kevin Tran <kevintvh@gmail.com> wrote:
>
>> Hi Everyone,
>> Does anyone know what is the best practise of writing parquet file from
>> Spark ?
>>
>> As Spark app write data to parquet and it shows that under that directory
>> there are heaps of very small parquet file (such as e73f47ef-4421-4bcc-a4db-a56b110c3089.parquet).
>> Each parquet file is only 15KB
>>
>> Should it write each chunk of  bigger data size (such as 128 MB) with
>> proper number of files ?
>>
>> Does anyone find out any performance changes when changing data size of
>> each parquet file ?
>>
>> Thanks,
>> Kevin.
>>
>

Mime
View raw message