spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Swapnil Shinde <>
Subject Re: Huge partitioning job takes longer to close after all tasks finished
Date Wed, 08 Mar 2017 20:00:54 GMT
Thank you liu. Can you please explain what do you mean by enabling spark
fault tolerant mechanism?
I observed that after all tasks finishes, spark is working on concatenating
same partitions from all tasks on file system. eg,
task1 - partition1, partition2, partition3
task2 - partition1, partition2, partition3

Then after task1, task2 finishes, spark concatenates partition1 from task1,
task2 to create partition1. This is taking longer if we have large number
of files. I am not sure if there is a way to let spark not to concatenate
partitions from each task.


On Tue, Mar 7, 2017 at 10:47 PM, cht liu <> wrote:

> Do you enable the spark fault tolerance mechanism, RDD run at the end of
> the job, will start a separate job, to the checkpoint data written to the
> file system before the persistence of high availability
> 2017-03-08 2:45 GMT+08:00 Swapnil Shinde <>:
>> Hello all
>>    I have a spark job that reads parquet data and partition it based on
>> one of the columns. I made sure partitions equally distributed and not
>> skewed. My code looks like this -
>> datasetA.write.partitonBy("column1").parquet(outputPath)
>> Execution plan -
>> [image: Inline image 1]
>> All tasks(~12,000) finishes in 30-35 mins but it takes another 40-45 mins
>> to close application. I am not sure what spark is doing after all tasks are
>> processes successfully.
>> I checked thread dump (using UI executor tab) on few executors but
>> couldnt find anything major. Overall, few shuffle-client processes are
>> "RUNNABLE" and few dispatched-* processes are "WAITING".
>> Please let me know what spark is doing at this stage(after all tasks
>> finished) and any way I can optimize it.
>> Thanks
>> Swapnil

View raw message