spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mayur Rustagi <mayur.rust...@gmail.com>
Subject Re: trying to understand job cancellation
Date Wed, 05 Mar 2014 22:55:42 GMT
One issue is that job cancellation is posted on eventloop. So its possible
that subsequent jobs submitted to job queue may beat the job cancellation
event & hence the job cancellation event may end up closing them too.
So there's definitely a race condition you are risking even if not running
into.

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>



On Wed, Mar 5, 2014 at 2:40 PM, Koert Kuipers <koert@tresata.com> wrote:

> SparkContext.cancelJobGroup
>
>
> On Wed, Mar 5, 2014 at 5:32 PM, Mayur Rustagi <mayur.rustagi@gmail.com>wrote:
>
>> How do you cancel the job. Which API do you use?
>>
>> Mayur Rustagi
>> Ph: +1 (760) 203 3257
>> http://www.sigmoidanalytics.com
>>  @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>
>>
>>
>> On Wed, Mar 5, 2014 at 2:29 PM, Koert Kuipers <koert@tresata.com> wrote:
>>
>>> i also noticed that jobs (with a new JobGroupId) which i run after this
>>> use which use the same RDDs get very confused. i see lots of cancelled
>>> stages and retries that go on forever.
>>>
>>>
>>> On Tue, Mar 4, 2014 at 5:02 PM, Koert Kuipers <koert@tresata.com> wrote:
>>>
>>>> i have a running job that i cancel while keeping the spark context
>>>> alive.
>>>>
>>>> at the time of cancellation the active stage is 14.
>>>>
>>>> i see in logs:
>>>> 2014/03/04 16:43:19 INFO scheduler.DAGScheduler: Asked to cancel job
>>>> group 3a25db23-2e39-4497-b7ab-b26b2a976f9c
>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage
>>>> 10
>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage
>>>> 14
>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Stage 14 was
>>>> cancelled
>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Remove TaskSet
>>>> 14.0 from pool x
>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage
>>>> 13
>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage
>>>> 12
>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage
>>>> 11
>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage
>>>> 15
>>>>
>>>> so far it all looks good. then i get a lot of messages like this:
>>>> 2014/03/04 16:43:20 INFO scheduler.TaskSchedulerImpl: Ignoring update
>>>> with state FINISHED from TID 883 because its task set is gone
>>>> 2014/03/04 16:43:24 INFO scheduler.TaskSchedulerImpl: Ignoring update
>>>> with state KILLED from TID 888 because its task set is gone
>>>>
>>>> after this stage 14 hangs around in active stages, without any sign of
>>>> progress or cancellation. it just sits there forever, stuck. looking at the
>>>> logs of the executors confirms this. they task seem to be still running,
>>>> but nothing is happening. for example (by the time i look at this its 4:58
>>>> so this tasks hasnt done anything in 15 mins):
>>>>
>>>> 14/03/04 16:43:16 INFO Executor: Serialized size of result for 943 is
>>>> 1007
>>>> 14/03/04 16:43:16 INFO Executor: Sending result for 943 directly to
>>>> driver
>>>> 14/03/04 16:43:16 INFO Executor: Finished task ID 943
>>>> 14/03/04 16:43:16 INFO Executor: Serialized size of result for 945 is
>>>> 1007
>>>> 14/03/04 16:43:16 INFO Executor: Sending result for 945 directly to
>>>> driver
>>>> 14/03/04 16:43:16 INFO Executor: Finished task ID 945
>>>> 14/03/04 16:43:19 INFO BlockManager: Removing RDD 66
>>>>
>>>> not sure what to make of this. any suggestions? best, koert
>>>>
>>>
>>>
>>
>

Mime
View raw message