spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mayur Rustagi <mayur.rust...@gmail.com>
Subject Re: trying to understand job cancellation
Date Wed, 05 Mar 2014 23:47:41 GMT
You can randomize job groups as well. to secure yourself against
termination.

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>



On Wed, Mar 5, 2014 at 3:42 PM, Koert Kuipers <koert@tresata.com> wrote:

> got it. seems like i better stay away from this feature for now..
>
>
> On Wed, Mar 5, 2014 at 5:55 PM, Mayur Rustagi <mayur.rustagi@gmail.com>wrote:
>
>> One issue is that job cancellation is posted on eventloop. So its
>> possible that subsequent jobs submitted to job queue may beat the job
>> cancellation event & hence the job cancellation event may end up closing
>> them too.
>> So there's definitely a race condition you are risking even if not
>> running into.
>>
>> Mayur Rustagi
>> Ph: +1 (760) 203 3257
>> http://www.sigmoidanalytics.com
>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>
>>
>>
>> On Wed, Mar 5, 2014 at 2:40 PM, Koert Kuipers <koert@tresata.com> wrote:
>>
>>> SparkContext.cancelJobGroup
>>>
>>>
>>> On Wed, Mar 5, 2014 at 5:32 PM, Mayur Rustagi <mayur.rustagi@gmail.com>wrote:
>>>
>>>> How do you cancel the job. Which API do you use?
>>>>
>>>> Mayur Rustagi
>>>> Ph: +1 (760) 203 3257
>>>> http://www.sigmoidanalytics.com
>>>>  @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>>>
>>>>
>>>>
>>>> On Wed, Mar 5, 2014 at 2:29 PM, Koert Kuipers <koert@tresata.com>wrote:
>>>>
>>>>> i also noticed that jobs (with a new JobGroupId) which i run after
>>>>> this use which use the same RDDs get very confused. i see lots of cancelled
>>>>> stages and retries that go on forever.
>>>>>
>>>>>
>>>>> On Tue, Mar 4, 2014 at 5:02 PM, Koert Kuipers <koert@tresata.com>wrote:
>>>>>
>>>>>> i have a running job that i cancel while keeping the spark context
>>>>>> alive.
>>>>>>
>>>>>> at the time of cancellation the active stage is 14.
>>>>>>
>>>>>> i see in logs:
>>>>>> 2014/03/04 16:43:19 INFO scheduler.DAGScheduler: Asked to cancel
job
>>>>>> group 3a25db23-2e39-4497-b7ab-b26b2a976f9c
>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>>>>>> stage 10
>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>>>>>> stage 14
>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Stage 14 was
>>>>>> cancelled
>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Remove TaskSet
>>>>>> 14.0 from pool x
>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>>>>>> stage 13
>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>>>>>> stage 12
>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>>>>>> stage 11
>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>>>>>> stage 15
>>>>>>
>>>>>> so far it all looks good. then i get a lot of messages like this:
>>>>>> 2014/03/04 16:43:20 INFO scheduler.TaskSchedulerImpl: Ignoring update
>>>>>> with state FINISHED from TID 883 because its task set is gone
>>>>>> 2014/03/04 16:43:24 INFO scheduler.TaskSchedulerImpl: Ignoring update
>>>>>> with state KILLED from TID 888 because its task set is gone
>>>>>>
>>>>>> after this stage 14 hangs around in active stages, without any sign
>>>>>> of progress or cancellation. it just sits there forever, stuck. looking
at
>>>>>> the logs of the executors confirms this. they task seem to be still
>>>>>> running, but nothing is happening. for example (by the time i look
at this
>>>>>> its 4:58 so this tasks hasnt done anything in 15 mins):
>>>>>>
>>>>>> 14/03/04 16:43:16 INFO Executor: Serialized size of result for 943
is
>>>>>> 1007
>>>>>> 14/03/04 16:43:16 INFO Executor: Sending result for 943 directly
to
>>>>>> driver
>>>>>> 14/03/04 16:43:16 INFO Executor: Finished task ID 943
>>>>>> 14/03/04 16:43:16 INFO Executor: Serialized size of result for 945
is
>>>>>> 1007
>>>>>> 14/03/04 16:43:16 INFO Executor: Sending result for 945 directly
to
>>>>>> driver
>>>>>> 14/03/04 16:43:16 INFO Executor: Finished task ID 945
>>>>>> 14/03/04 16:43:19 INFO BlockManager: Removing RDD 66
>>>>>>
>>>>>> not sure what to make of this. any suggestions? best, koert
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message