spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Silvio Fiorito <silvio.fior...@granturing.com>
Subject Re: Writing a DataFrame is taking too long and huge space
Date Fri, 09 Mar 2018 15:04:36 GMT
Given you start with ~250MB but end up with 58GB seems like you’re generating quite a bit
of data.

Whether you use coalesce or repartition, still writing out 58GB with one core is going to
take a while.

Using Spark to do pre-processing but output a single file is not going to be very efficient
since you’re asking Spark to limit its parallelization even if just the final stage to write
data out.

What are you using downstream to read this file and why does it need to be a single 58GB file?
Could you simply keep it in Spark to keep the pipeline optimized and avoid the data persistence
step? For example, if you’re using R or Python to do some downstream processing you could
just make that part of your pipeline vs writing it out and then reading it back in from another
system.


From: Vadim Semenov <vadim@datadoghq.com>
Date: Friday, March 9, 2018 at 9:42 AM
To: "Md. Rezaul Karim" <rezaul.karim@insight-centre.org>
Cc: spark users <user@spark.apache.org>
Subject: Re: Writing a DataFrame is taking too long and huge space

because `coalesce` gets propagated further up in the DAG in the last stage, so your last stage
only has one task.

You need to break your DAG so your expensive operations would be in a previous stage before
the stage with `.coalesce(1)`

On Fri, Mar 9, 2018 at 5:23 AM, Md. Rezaul Karim <rezaul.karim@insight-centre.org<mailto:rezaul.karim@insight-centre.org>>
wrote:
Dear All,
I have a tiny CSV file, which is around 250MB. There are only 30 columns in the DataFrame.
Now I'm trying to save the pre-processed DataFrame as an another CSV file on disk for later
usage.
However, I'm getting pissed off as writing the resultant DataFrame is taking too long, which
is about 4 to 5 hours. Nevertheless, the size of the file written on the disk is about 58GB!

Here's the sample code that I tried:
# Using repartition()
myDF.repartition(1).write.format("com.databricks.spark.csv").save("data/file.csv")

# Using coalesce()
myDF. coalesce(1).write.format("com.databricks.spark.csv").save("data/file.csv")

Any better suggestion?




----
Md. Rezaul Karim, BSc, MSc
Research Scientist, Fraunhofer FIT, Germany

Ph.D. Researcher, Information Systems, RWTH Aachen University, Germany

eMail: rezaul.karim@fit.fraunhofer.de<mailto:andrea.bernards@fit.fraunhofer.de>
Tel: +49 241 80-21527<tel:+49%20241%208021527>



--
Sent from my iPhone
Mime
View raw message