spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Koert Kuipers <ko...@tresata.com>
Subject Re: trying to understand job cancellation
Date Wed, 05 Mar 2014 23:42:13 GMT
got it. seems like i better stay away from this feature for now..


On Wed, Mar 5, 2014 at 5:55 PM, Mayur Rustagi <mayur.rustagi@gmail.com>wrote:

> One issue is that job cancellation is posted on eventloop. So its possible
> that subsequent jobs submitted to job queue may beat the job cancellation
> event & hence the job cancellation event may end up closing them too.
> So there's definitely a race condition you are risking even if not running
> into.
>
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>
>
>
> On Wed, Mar 5, 2014 at 2:40 PM, Koert Kuipers <koert@tresata.com> wrote:
>
>> SparkContext.cancelJobGroup
>>
>>
>> On Wed, Mar 5, 2014 at 5:32 PM, Mayur Rustagi <mayur.rustagi@gmail.com>wrote:
>>
>>> How do you cancel the job. Which API do you use?
>>>
>>> Mayur Rustagi
>>> Ph: +1 (760) 203 3257
>>> http://www.sigmoidanalytics.com
>>>  @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>>
>>>
>>>
>>> On Wed, Mar 5, 2014 at 2:29 PM, Koert Kuipers <koert@tresata.com> wrote:
>>>
>>>> i also noticed that jobs (with a new JobGroupId) which i run after this
>>>> use which use the same RDDs get very confused. i see lots of cancelled
>>>> stages and retries that go on forever.
>>>>
>>>>
>>>> On Tue, Mar 4, 2014 at 5:02 PM, Koert Kuipers <koert@tresata.com>wrote:
>>>>
>>>>> i have a running job that i cancel while keeping the spark context
>>>>> alive.
>>>>>
>>>>> at the time of cancellation the active stage is 14.
>>>>>
>>>>> i see in logs:
>>>>> 2014/03/04 16:43:19 INFO scheduler.DAGScheduler: Asked to cancel job
>>>>> group 3a25db23-2e39-4497-b7ab-b26b2a976f9c
>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage
>>>>> 10
>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage
>>>>> 14
>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Stage 14 was
>>>>> cancelled
>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Remove TaskSet
>>>>> 14.0 from pool x
>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage
>>>>> 13
>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage
>>>>> 12
>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage
>>>>> 11
>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage
>>>>> 15
>>>>>
>>>>> so far it all looks good. then i get a lot of messages like this:
>>>>> 2014/03/04 16:43:20 INFO scheduler.TaskSchedulerImpl: Ignoring update
>>>>> with state FINISHED from TID 883 because its task set is gone
>>>>> 2014/03/04 16:43:24 INFO scheduler.TaskSchedulerImpl: Ignoring update
>>>>> with state KILLED from TID 888 because its task set is gone
>>>>>
>>>>> after this stage 14 hangs around in active stages, without any sign of
>>>>> progress or cancellation. it just sits there forever, stuck. looking
at the
>>>>> logs of the executors confirms this. they task seem to be still running,
>>>>> but nothing is happening. for example (by the time i look at this its
4:58
>>>>> so this tasks hasnt done anything in 15 mins):
>>>>>
>>>>> 14/03/04 16:43:16 INFO Executor: Serialized size of result for 943 is
>>>>> 1007
>>>>> 14/03/04 16:43:16 INFO Executor: Sending result for 943 directly to
>>>>> driver
>>>>> 14/03/04 16:43:16 INFO Executor: Finished task ID 943
>>>>> 14/03/04 16:43:16 INFO Executor: Serialized size of result for 945 is
>>>>> 1007
>>>>> 14/03/04 16:43:16 INFO Executor: Sending result for 945 directly to
>>>>> driver
>>>>> 14/03/04 16:43:16 INFO Executor: Finished task ID 945
>>>>> 14/03/04 16:43:19 INFO BlockManager: Removing RDD 66
>>>>>
>>>>> not sure what to make of this. any suggestions? best, koert
>>>>>
>>>>
>>>>
>>>
>>
>

Mime
View raw message