spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Terry Kim (Jira)" <>
Subject [jira] [Commented] (SPARK-30316) data size boom after shuffle writing dataframe save as parquet
Date Sun, 22 Dec 2019 05:59:00 GMT


Terry Kim commented on SPARK-30316:

This is a possible scenario because when you repartition/shuffle the data, the values you
are storing could be reordered such that the compression ratio could become worse, for example.  

> data size boom after shuffle writing dataframe save as parquet
> --------------------------------------------------------------
>                 Key: SPARK-30316
>                 URL:
>             Project: Spark
>          Issue Type: Improvement
>          Components: Shuffle, SQL
>    Affects Versions: 2.4.4
>            Reporter: Cesc 
>            Priority: Blocker
> When I read a same parquet file and then save it in two ways, with shuffle and without
shuffle, I found the size of output parquet files are quite different. For example,  an origin
parquet file with 800 MB size, if save without shuffle, the size is still 800MB, whereas if
I use method repartition and then save it as in parquet format, the data size increase to
2.5GB. Row numbers, column numbers and content of two output files are all the same.
> I wonder:
> firstly, why data size will increase after repartition/shuffle?
> secondly, if I need shuffle the input dataframe, how to save it as parquet file efficiently
to avoid data size boom?

This message was sent by Atlassian Jira

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message