Hi Lian,

I was following the thread since one of my students had the same issue. The problem was when trying to save a larger XML dataset into HDFS and due to the connectivity timeout between Spark and HDFS, the output wasn't able to be displayed.
I also suggested him to do the same as @Apostolos said in the previous mail, using saveAsTextFile instead (haven't got any result/reply after my suggestion).

Seeing the last commit date "Jan 10, 2017" made on databricks/spark-csv [1] project, not sure how much inline with Spark 2.x is. Even though there is a note about it on the README file.

Would it be possible that you share your solution (in case the project is open-sourced already) with us and then we can have a look at it?

Many thanks in advance.

Best regards,
[1]. https://github.com/databricks/spark-csv

On Tue, Mar 26, 2019 at 1:09 AM Lian Jiang <jiangok2006@gmail.com> wrote:
Thanks guys for reply.

The execution plan shows a giant query. After divide and conquer, saving is quick.

On Fri, Mar 22, 2019 at 4:01 PM kathy Harayama <kathleenli168@gmail.com> wrote:
Hi Lian,
Since you using repartition(1), do you want to decrease the number of partitions? If so, have you tried to use coalesce instead?

Kathleen

On Fri, Mar 22, 2019 at 2:43 PM Lian Jiang <jiangok2006@gmail.com> wrote:
Hi,

Writing a csv to HDFS takes about 1 hour:

df.repartition(1).write.format('com.databricks.spark.csv').mode('overwrite').options(header='true').save(csv)

The generated csv file is only about 150kb. The job uses 3 containers (13 cores, 23g mem).

Other people have similar issues but I don't see a good explanation and solution.

Any clue is highly appreciated! Thanks.




--

_____________

Gëzim Sejdiu

 

PhD Student & Research Associate

SDA, University of Bonn

Endenicher Allee 19a, 53115 Bonn, Germany

https://gezimsejdiu.github.io/

GitHub | Twitter | LinkedIn | Google Scholar