You can use `.checkpoint` for that

`df.sort(…).coalesce(1).write...` — `coalesce` will make `sort` to have only one partition, so sorting will take a lot of time

`df.sort(…).repartition(1).write...` — `repartition` will add an explicit stage, but sorting will be lost, since it's a repartition

val checkpointedDf = df.sort(…).checkpoint(eager=true) // will save all partitions
checkpointedDf.coalesce(1).write.csv(…) // will load checkpointed partitions in one task, concatenate them, and will write them out as a single file

On Fri, Mar 9, 2018 at 9:47 AM, Deepak Sharma <> wrote:
I would suggest repartioning it to reasonable partitions  may ne 500 and save it to some intermediate working directory .
Finally read all the files from this working dir and then coalesce as 1 and save to final location.


On Fri, Mar 9, 2018, 20:12 Vadim Semenov <> wrote:
because `coalesce` gets propagated further up in the DAG in the last stage, so your last stage only has one task.

You need to break your DAG so your expensive operations would be in a previous stage before the stage with `.coalesce(1)`

On Fri, Mar 9, 2018 at 5:23 AM, Md. Rezaul Karim <> wrote:
Dear All,

I have a tiny CSV file, which is around 250MB. There are only 30 columns in the DataFrame. Now I'm trying to save the pre-processed DataFrame as an another CSV file on disk for later usage.

However, I'm getting pissed off as writing the resultant DataFrame is taking too long, which is about 4 to 5 hours. Nevertheless, the size of the file written on the disk is about 58GB! 

Here's the sample code that I tried:

# Using repartition()

# Using coalesce()
myDF. coalesce(1).write.format("com.databricks.spark.csv").save("data/file.csv")

Any better suggestion?

Md. Rezaul Karim, BSc, MSc
Research Scientist, Fraunhofer FIT, Germany

Ph.D. Researcher, Information Systems, RWTH Aachen University, Germany

Tel: +49 241 80-21527

Sent from my iPhone

Sent from my iPhone