spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vadim Semenov <va...@datadoghq.com>
Subject Re: Writing a DataFrame is taking too long and huge space
Date Fri, 09 Mar 2018 15:04:41 GMT
But overall, I think the original approach is not correct.
If you get a single file in 10s GB, the approach is probably must be
reworked.

I don't see why you can't just write multiple CSV files using Spark, and
then concatenate them without Spark

On Fri, Mar 9, 2018 at 10:02 AM, Vadim Semenov <vadim@datadoghq.com> wrote:

> You can use `.checkpoint` for that
>
> `df.sort(…).coalesce(1).write...` — `coalesce` will make `sort` to have
> only one partition, so sorting will take a lot of time
>
> `df.sort(…).repartition(1).write...` — `repartition` will add an explicit
> stage, but sorting will be lost, since it's a repartition
>
> ```
> sc.setCheckpointDir("/tmp/test")
> val checkpointedDf = df.sort(…).checkpoint(eager=true) // will save all
> partitions
> checkpointedDf.coalesce(1).write.csv(…) // will load checkpointed
> partitions in one task, concatenate them, and will write them out as a
> single file
> ```
>
> On Fri, Mar 9, 2018 at 9:47 AM, Deepak Sharma <deepakmca05@gmail.com>
> wrote:
>
>> I would suggest repartioning it to reasonable partitions  may ne 500 and
>> save it to some intermediate working directory .
>> Finally read all the files from this working dir and then coalesce as 1
>> and save to final location.
>>
>> Thanks
>> Deepak
>>
>> On Fri, Mar 9, 2018, 20:12 Vadim Semenov <vadim@datadoghq.com> wrote:
>>
>>> because `coalesce` gets propagated further up in the DAG in the last
>>> stage, so your last stage only has one task.
>>>
>>> You need to break your DAG so your expensive operations would be in a
>>> previous stage before the stage with `.coalesce(1)`
>>>
>>> On Fri, Mar 9, 2018 at 5:23 AM, Md. Rezaul Karim <
>>> rezaul.karim@insight-centre.org> wrote:
>>>
>>>> Dear All,
>>>>
>>>> I have a tiny CSV file, which is around 250MB. There are only 30
>>>> columns in the DataFrame. Now I'm trying to save the pre-processed
>>>> DataFrame as an another CSV file on disk for later usage.
>>>>
>>>> However, I'm getting pissed off as writing the resultant DataFrame is
>>>> taking too long, which is about 4 to 5 hours. Nevertheless, the size of the
>>>> file written on the disk is about 58GB!
>>>>
>>>> Here's the sample code that I tried:
>>>>
>>>> # Using repartition()
>>>> myDF.repartition(1).write.format("com.databricks.spark.csv")
>>>> .save("data/file.csv")
>>>>
>>>> # Using coalesce()
>>>> myDF. coalesce(1).write.format("com.databricks.spark.csv").save("d
>>>> ata/file.csv")
>>>>
>>>>
>>>> Any better suggestion?
>>>>
>>>>
>>>>
>>>> ----
>>>> Md. Rezaul Karim, BSc, MSc
>>>> Research Scientist, Fraunhofer FIT, Germany
>>>>
>>>> Ph.D. Researcher, Information Systems, RWTH Aachen University, Germany
>>>>
>>>> eMail: rezaul.karim@fit.fraunhofer.de
>>>> <andrea.bernards@fit.fraunhofer.de>
>>>> Tel: +49 241 80-21527 <+49%20241%208021527>
>>>>
>>>
>>>
>>>
>>> --
>>> Sent from my iPhone
>>>
>>
>
>
> --
> Sent from my iPhone
>



-- 
Sent from my iPhone

Mime
View raw message