spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gezim Sejdiu <g.sej...@gmail.com>
Subject Re: writing a small csv to HDFS is super slow
Date Wed, 27 Mar 2019 07:44:41 GMT
Hi Lian,

many thanks for the detailed information and sharing the solution with us.
I will forward this to a student and hopefully will resolve the issue.

Best regards,

On Wed, Mar 27, 2019 at 1:55 AM Lian Jiang <jiangok2006@gmail.com> wrote:

> Hi Gezim,
>
> My execution plan of the data frame to write into HDFS is a union of 140
> children dataframes. All these children data frames are not materialized
> when writing to HDFS. It is not saving file taking time. Instead, it is
> materializing the dataframes taking time. My solution is to materialize all
> the children dataframe and save into HDFS. Then union the pre-existing
> children dataframes and saving to HDFS is very fast.
>
> Hope this helps!
>
> On Tue, Mar 26, 2019 at 1:50 PM Gezim Sejdiu <g.sejdiu@gmail.com> wrote:
>
>> Hi Lian,
>>
>> I was following the thread since one of my students had the same issue.
>> The problem was when trying to save a larger XML dataset into HDFS and due
>> to the connectivity timeout between Spark and HDFS, the output wasn't able
>> to be displayed.
>> I also suggested him to do the same as @Apostolos said in the previous
>> mail, using saveAsTextFile instead (haven't got any result/reply after my
>> suggestion).
>>
>> Seeing the last commit date "*Jan 10, 2017*" made
>> on databricks/spark-csv [1] project, not sure how much inline with Spark
>> 2.x is. Even though there is a *note* about it on the README file.
>>
>> Would it be possible that you share your solution (in case the project is
>> open-sourced already) with us and then we can have a look at it?
>>
>> Many thanks in advance.
>>
>> Best regards,
>> [1]. https://github.com/databricks/spark-csv
>>
>> On Tue, Mar 26, 2019 at 1:09 AM Lian Jiang <jiangok2006@gmail.com> wrote:
>>
>>> Thanks guys for reply.
>>>
>>> The execution plan shows a giant query. After divide and conquer, saving
>>> is quick.
>>>
>>> On Fri, Mar 22, 2019 at 4:01 PM kathy Harayama <kathleenli168@gmail.com>
>>> wrote:
>>>
>>>> Hi Lian,
>>>> Since you using repartition(1), do you want to decrease the number of
>>>> partitions? If so, have you tried to use coalesce instead?
>>>>
>>>> Kathleen
>>>>
>>>> On Fri, Mar 22, 2019 at 2:43 PM Lian Jiang <jiangok2006@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Writing a csv to HDFS takes about 1 hour:
>>>>>
>>>>>
>>>>> df.repartition(1).write.format('com.databricks.spark.csv').mode('overwrite').options(header='true').save(csv)
>>>>>
>>>>> The generated csv file is only about 150kb. The job uses 3 containers
>>>>> (13 cores, 23g mem).
>>>>>
>>>>> Other people have similar issues but I don't see a good explanation
>>>>> and solution.
>>>>>
>>>>> Any clue is highly appreciated! Thanks.
>>>>>
>>>>>
>>>>>
>>
>> --
>>
>> _____________
>>
>> *Gëzim Sejdiu*
>>
>>
>>
>> *PhD Student & Research Associate*
>>
>> *SDA, University of Bonn*
>>
>> *Endenicher Allee 19a, 53115 Bonn, Germany*
>>
>> *https://gezimsejdiu.github.io/ <https://gezimsejdiu.github.io/>*
>>
>> GitHub <https://github.com/GezimSejdiu> | Twitter
>> <https://twitter.com/Gezim_Sejdiu> | LinkedIn
>> <https://www.linkedin.com/in/g%C3%ABzim-sejdiu-08b1761b> | Google Scholar
>> <https://scholar.google.de/citations?user=Lpbwr9oAAAAJ>
>>
>>

-- 

_____________

*Gëzim Sejdiu*



*PhD Student & Research Associate*

*SDA, University of Bonn*

*Endenicher Allee 19a, 53115 Bonn, Germany*

*https://gezimsejdiu.github.io/ <https://gezimsejdiu.github.io/>*

GitHub <https://github.com/GezimSejdiu> | Twitter
<https://twitter.com/Gezim_Sejdiu> | LinkedIn
<https://www.linkedin.com/in/g%C3%ABzim-sejdiu-08b1761b> | Google Scholar
<https://scholar.google.de/citations?user=Lpbwr9oAAAAJ>

Mime
View raw message