Hi Lian,

many thanks for the detailed information and sharing the solution with us. I will forward this to a student and hopefully will resolve the issue. 

Best regards,

On Wed, Mar 27, 2019 at 1:55 AM Lian Jiang <jiangok2006@gmail.com> wrote:
Hi Gezim,

My execution plan of the data frame to write into HDFS is a union of 140 children dataframes. All these children data frames are not materialized when writing to HDFS. It is not saving file taking time. Instead, it is materializing the dataframes taking time. My solution is to materialize all the children dataframe and save into HDFS. Then union the pre-existing children dataframes and saving to HDFS is very fast.

Hope this helps!

On Tue, Mar 26, 2019 at 1:50 PM Gezim Sejdiu <g.sejdiu@gmail.com> wrote:
Hi Lian,

I was following the thread since one of my students had the same issue. The problem was when trying to save a larger XML dataset into HDFS and due to the connectivity timeout between Spark and HDFS, the output wasn't able to be displayed.
I also suggested him to do the same as @Apostolos said in the previous mail, using saveAsTextFile instead (haven't got any result/reply after my suggestion).

Seeing the last commit date "Jan 10, 2017" made on databricks/spark-csv [1] project, not sure how much inline with Spark 2.x is. Even though there is a note about it on the README file.

Would it be possible that you share your solution (in case the project is open-sourced already) with us and then we can have a look at it?

Many thanks in advance.

Best regards,

On Tue, Mar 26, 2019 at 1:09 AM Lian Jiang <jiangok2006@gmail.com> wrote:
Thanks guys for reply.

The execution plan shows a giant query. After divide and conquer, saving is quick.

On Fri, Mar 22, 2019 at 4:01 PM kathy Harayama <kathleenli168@gmail.com> wrote:
Hi Lian,
Since you using repartition(1), do you want to decrease the number of partitions? If so, have you tried to use coalesce instead?

Kathleen

On Fri, Mar 22, 2019 at 2:43 PM Lian Jiang <jiangok2006@gmail.com> wrote:
Hi,

Writing a csv to HDFS takes about 1 hour:

df.repartition(1).write.format('com.databricks.spark.csv').mode('overwrite').options(header='true').save(csv)

The generated csv file is only about 150kb. The job uses 3 containers (13 cores, 23g mem).

Other people have similar issues but I don't see a good explanation and solution.

Any clue is highly appreciated! Thanks.




--

_____________

Gëzim Sejdiu

 

PhD Student & Research Associate

SDA, University of Bonn

Endenicher Allee 19a, 53115 Bonn, Germany

https://gezimsejdiu.github.io/

GitHub | Twitter | LinkedIn | Google Scholar



--

_____________

Gëzim Sejdiu

 

PhD Student & Research Associate

SDA, University of Bonn

Endenicher Allee 19a, 53115 Bonn, Germany

https://gezimsejdiu.github.io/

GitHub | Twitter | LinkedIn | Google Scholar