spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Swapnil Shinde <swapnilushi...@gmail.com>
Subject Huge partitioning job takes longer to close after all tasks finished
Date Tue, 07 Mar 2017 18:45:15 GMT
Hello all
   I have a spark job that reads parquet data and partition it based on one
of the columns. I made sure partitions equally distributed and not skewed.
My code looks like this -

datasetA.write.partitonBy("column1").parquet(outputPath)

Execution plan -
[image: Inline image 1]

All tasks(~12,000) finishes in 30-35 mins but it takes another 40-45 mins
to close application. I am not sure what spark is doing after all tasks are
processes successfully.
I checked thread dump (using UI executor tab) on few executors but couldnt
find anything major. Overall, few shuffle-client processes are "RUNNABLE"
and few dispatched-* processes are "WAITING".

Please let me know what spark is doing at this stage(after all tasks
finished) and any way I can optimize it.

Thanks
Swapnil

Mime
View raw message