spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Deepak Sharma <deepakmc...@gmail.com>
Subject Re: Writing a DataFrame is taking too long and huge space
Date Fri, 09 Mar 2018 14:47:14 GMT
I would suggest repartioning it to reasonable partitions  may ne 500 and
save it to some intermediate working directory .
Finally read all the files from this working dir and then coalesce as 1 and
save to final location.

Thanks
Deepak

On Fri, Mar 9, 2018, 20:12 Vadim Semenov <vadim@datadoghq.com> wrote:

> because `coalesce` gets propagated further up in the DAG in the last
> stage, so your last stage only has one task.
>
> You need to break your DAG so your expensive operations would be in a
> previous stage before the stage with `.coalesce(1)`
>
> On Fri, Mar 9, 2018 at 5:23 AM, Md. Rezaul Karim <
> rezaul.karim@insight-centre.org> wrote:
>
>> Dear All,
>>
>> I have a tiny CSV file, which is around 250MB. There are only 30 columns
>> in the DataFrame. Now I'm trying to save the pre-processed DataFrame as an
>> another CSV file on disk for later usage.
>>
>> However, I'm getting pissed off as writing the resultant DataFrame is
>> taking too long, which is about 4 to 5 hours. Nevertheless, the size of the
>> file written on the disk is about 58GB!
>>
>> Here's the sample code that I tried:
>>
>> # Using repartition()
>>
>> myDF.repartition(1).write.format("com.databricks.spark.csv").save("data/file.csv")
>>
>> # Using coalesce()
>> myDF.
>> coalesce(1).write.format("com.databricks.spark.csv").save("data/file.csv")
>>
>>
>> Any better suggestion?
>>
>>
>>
>> ----
>> Md. Rezaul Karim, BSc, MSc
>> Research Scientist, Fraunhofer FIT, Germany
>>
>> Ph.D. Researcher, Information Systems, RWTH Aachen University, Germany
>>
>> eMail: rezaul.karim@fit.fraunhofer.de <andrea.bernards@fit.fraunhofer.de>
>> Tel: +49 241 80-21527 <+49%20241%208021527>
>>
>
>
>
> --
> Sent from my iPhone
>

Mime
View raw message