spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Teemu Heikkilä <te...@emblica.fi>
Subject Re: Writing a DataFrame is taking too long and huge space
Date Fri, 09 Mar 2018 12:41:59 GMT
Sounds like you’re doing something else than just writing the same file back to disk, what
your preprocessing consists?

Sometimes you can save lot’s of space by using other formats but now we’re talking over
200x increase in file size so depending on the transformations for the data you might not
get so huge savings by using some other format.

If you can give more details about what you are doing with the data we could probably help
with your task.

Slowness probably happens because Spark is using disk to process the data into single partition
for writing the single file, one thing to reconsider is that if you can merge the product
files after the process or even pre-partition it for it’s final use case.

- Teemu

> On 9.3.2018, at 12.23, Md. Rezaul Karim <rezaul.karim@insight-centre.org> wrote:
> 
> Dear All,
> 
> I have a tiny CSV file, which is around 250MB. There are only 30 columns in the DataFrame.
Now I'm trying to save the pre-processed DataFrame as an another CSV file on disk for later
usage. 
> 
> However, I'm getting pissed off as writing the resultant DataFrame is taking too long,
which is about 4 to 5 hours. Nevertheless, the size of the file written on the disk is about
58GB!  
> 
> Here's the sample code that I tried:
> 
> # Using repartition()
> myDF.repartition(1).write.format("com.databricks.spark.csv").save("data/file.csv")
> 
> # Using coalesce()
> myDF. coalesce(1).write.format("com.databricks.spark.csv").save("data/file.csv")
> 
> 
> Any better suggestion? 
> 
> 
> 
> ---- 
> Md. Rezaul Karim, BSc, MSc
> Research Scientist, Fraunhofer FIT, Germany
> Ph.D. Researcher, Information Systems, RWTH Aachen University, Germany
> eMail: rezaul.karim@fit.fraunhofer.de <mailto:andrea.bernards@fit.fraunhofer.de>
> Tel: +49 241 80-21527


Mime
View raw message