spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gezim Sejdiu <g.sej...@gmail.com>
Subject Re: writing a small csv to HDFS is super slow
Date Tue, 26 Mar 2019 20:50:12 GMT
Hi Lian,

I was following the thread since one of my students had the same issue. The
problem was when trying to save a larger XML dataset into HDFS and due to
the connectivity timeout between Spark and HDFS, the output wasn't able to
be displayed.
I also suggested him to do the same as @Apostolos said in the previous
mail, using saveAsTextFile instead (haven't got any result/reply after my
suggestion).

Seeing the last commit date "*Jan 10, 2017*" made
on databricks/spark-csv [1] project, not sure how much inline with Spark
2.x is. Even though there is a *note* about it on the README file.

Would it be possible that you share your solution (in case the project is
open-sourced already) with us and then we can have a look at it?

Many thanks in advance.

Best regards,
[1]. https://github.com/databricks/spark-csv

On Tue, Mar 26, 2019 at 1:09 AM Lian Jiang <jiangok2006@gmail.com> wrote:

> Thanks guys for reply.
>
> The execution plan shows a giant query. After divide and conquer, saving
> is quick.
>
> On Fri, Mar 22, 2019 at 4:01 PM kathy Harayama <kathleenli168@gmail.com>
> wrote:
>
>> Hi Lian,
>> Since you using repartition(1), do you want to decrease the number of
>> partitions? If so, have you tried to use coalesce instead?
>>
>> Kathleen
>>
>> On Fri, Mar 22, 2019 at 2:43 PM Lian Jiang <jiangok2006@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Writing a csv to HDFS takes about 1 hour:
>>>
>>>
>>> df.repartition(1).write.format('com.databricks.spark.csv').mode('overwrite').options(header='true').save(csv)
>>>
>>> The generated csv file is only about 150kb. The job uses 3 containers
>>> (13 cores, 23g mem).
>>>
>>> Other people have similar issues but I don't see a good explanation and
>>> solution.
>>>
>>> Any clue is highly appreciated! Thanks.
>>>
>>>
>>>

-- 

_____________

*G√ęzim Sejdiu*



*PhD Student & Research Associate*

*SDA, University of Bonn*

*Endenicher Allee 19a, 53115 Bonn, Germany*

*https://gezimsejdiu.github.io/ <https://gezimsejdiu.github.io/>*

GitHub <https://github.com/GezimSejdiu> | Twitter
<https://twitter.com/Gezim_Sejdiu> | LinkedIn
<https://www.linkedin.com/in/g%C3%ABzim-sejdiu-08b1761b> | Google Scholar
<https://scholar.google.de/citations?user=Lpbwr9oAAAAJ>

Mime
View raw message