spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Md. Rezaul Karim" <rezaul.ka...@insight-centre.org>
Subject Writing a DataFrame is taking too long and huge space
Date Fri, 09 Mar 2018 10:23:41 GMT
Dear All,

I have a tiny CSV file, which is around 250MB. There are only 30 columns in
the DataFrame. Now I'm trying to save the pre-processed DataFrame as an
another CSV file on disk for later usage.

However, I'm getting pissed off as writing the resultant DataFrame is
taking too long, which is about 4 to 5 hours. Nevertheless, the size of the
file written on the disk is about 58GB!

Here's the sample code that I tried:

# Using repartition()
myDF.repartition(1).write.format("com.databricks.spark.csv").save("data/file.csv")

# Using coalesce()
myDF.
coalesce(1).write.format("com.databricks.spark.csv").save("data/file.csv")


Any better suggestion?



----
Md. Rezaul Karim, BSc, MSc
Research Scientist, Fraunhofer FIT, Germany

Ph.D. Researcher, Information Systems, RWTH Aachen University, Germany

eMail: rezaul.karim@fit.fraunhofer.de <andrea.bernards@fit.fraunhofer.de>
Tel: +49 241 80-21527

Mime
View raw message