spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From cht liu <>
Subject Re: Huge partitioning job takes longer to close after all tasks finished
Date Wed, 08 Mar 2017 03:47:52 GMT
Do you enable the spark fault tolerance mechanism, RDD run at the end of
the job, will start a separate job, to the checkpoint data written to the
file system before the persistence of high availability

2017-03-08 2:45 GMT+08:00 Swapnil Shinde <>:

> Hello all
>    I have a spark job that reads parquet data and partition it based on
> one of the columns. I made sure partitions equally distributed and not
> skewed. My code looks like this -
> datasetA.write.partitonBy("column1").parquet(outputPath)
> Execution plan -
> [image: Inline image 1]
> All tasks(~12,000) finishes in 30-35 mins but it takes another 40-45 mins
> to close application. I am not sure what spark is doing after all tasks are
> processes successfully.
> I checked thread dump (using UI executor tab) on few executors but couldnt
> find anything major. Overall, few shuffle-client processes are "RUNNABLE"
> and few dispatched-* processes are "WAITING".
> Please let me know what spark is doing at this stage(after all tasks
> finished) and any way I can optimize it.
> Thanks
> Swapnil

View raw message