spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mayur Rustagi <mayur.rust...@gmail.com>
Subject Re: trying to understand job cancellation
Date Thu, 06 Mar 2014 00:00:29 GMT
Quite unlikely as jobid are given in an incremental fashion, so your future
jobid are not likely to be killed if your groupid is not repeated.I guess
the issue is something else.

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>



On Wed, Mar 5, 2014 at 3:50 PM, Koert Kuipers <koert@tresata.com> wrote:

> i did that. my next job gets a random new group job id (a uuid). however
> that doesnt seem to stop the job from getting sucked into the cancellation
> it seems
>
>
> On Wed, Mar 5, 2014 at 6:47 PM, Mayur Rustagi <mayur.rustagi@gmail.com>wrote:
>
>> You can randomize job groups as well. to secure yourself against
>> termination.
>>
>> Mayur Rustagi
>> Ph: +1 (760) 203 3257
>> http://www.sigmoidanalytics.com
>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>
>>
>>
>> On Wed, Mar 5, 2014 at 3:42 PM, Koert Kuipers <koert@tresata.com> wrote:
>>
>>> got it. seems like i better stay away from this feature for now..
>>>
>>>
>>> On Wed, Mar 5, 2014 at 5:55 PM, Mayur Rustagi <mayur.rustagi@gmail.com>wrote:
>>>
>>>> One issue is that job cancellation is posted on eventloop. So its
>>>> possible that subsequent jobs submitted to job queue may beat the job
>>>> cancellation event & hence the job cancellation event may end up closing
>>>> them too.
>>>> So there's definitely a race condition you are risking even if not
>>>> running into.
>>>>
>>>> Mayur Rustagi
>>>> Ph: +1 (760) 203 3257
>>>> http://www.sigmoidanalytics.com
>>>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>>>
>>>>
>>>>
>>>> On Wed, Mar 5, 2014 at 2:40 PM, Koert Kuipers <koert@tresata.com>wrote:
>>>>
>>>>> SparkContext.cancelJobGroup
>>>>>
>>>>>
>>>>> On Wed, Mar 5, 2014 at 5:32 PM, Mayur Rustagi <mayur.rustagi@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> How do you cancel the job. Which API do you use?
>>>>>>
>>>>>> Mayur Rustagi
>>>>>> Ph: +1 (760) 203 3257
>>>>>> http://www.sigmoidanalytics.com
>>>>>>  @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Mar 5, 2014 at 2:29 PM, Koert Kuipers <koert@tresata.com>wrote:
>>>>>>
>>>>>>> i also noticed that jobs (with a new JobGroupId) which i run
after
>>>>>>> this use which use the same RDDs get very confused. i see lots
of cancelled
>>>>>>> stages and retries that go on forever.
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Mar 4, 2014 at 5:02 PM, Koert Kuipers <koert@tresata.com>wrote:
>>>>>>>
>>>>>>>> i have a running job that i cancel while keeping the spark
context
>>>>>>>> alive.
>>>>>>>>
>>>>>>>> at the time of cancellation the active stage is 14.
>>>>>>>>
>>>>>>>> i see in logs:
>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.DAGScheduler: Asked to
cancel
>>>>>>>> job group 3a25db23-2e39-4497-b7ab-b26b2a976f9c
>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>>>>>>>> stage 10
>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>>>>>>>> stage 14
>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Stage
14 was
>>>>>>>> cancelled
>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Remove
>>>>>>>> TaskSet 14.0 from pool x
>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>>>>>>>> stage 13
>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>>>>>>>> stage 12
>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>>>>>>>> stage 11
>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>>>>>>>> stage 15
>>>>>>>>
>>>>>>>> so far it all looks good. then i get a lot of messages like
this:
>>>>>>>> 2014/03/04 16:43:20 INFO scheduler.TaskSchedulerImpl: Ignoring
>>>>>>>> update with state FINISHED from TID 883 because its task
set is gone
>>>>>>>> 2014/03/04 16:43:24 INFO scheduler.TaskSchedulerImpl: Ignoring
>>>>>>>> update with state KILLED from TID 888 because its task set
is gone
>>>>>>>>
>>>>>>>> after this stage 14 hangs around in active stages, without
any sign
>>>>>>>> of progress or cancellation. it just sits there forever,
stuck. looking at
>>>>>>>> the logs of the executors confirms this. they task seem to
be still
>>>>>>>> running, but nothing is happening. for example (by the time
i look at this
>>>>>>>> its 4:58 so this tasks hasnt done anything in 15 mins):
>>>>>>>>
>>>>>>>> 14/03/04 16:43:16 INFO Executor: Serialized size of result
for 943
>>>>>>>> is 1007
>>>>>>>> 14/03/04 16:43:16 INFO Executor: Sending result for 943 directly
to
>>>>>>>> driver
>>>>>>>> 14/03/04 16:43:16 INFO Executor: Finished task ID 943
>>>>>>>> 14/03/04 16:43:16 INFO Executor: Serialized size of result
for 945
>>>>>>>> is 1007
>>>>>>>> 14/03/04 16:43:16 INFO Executor: Sending result for 945 directly
to
>>>>>>>> driver
>>>>>>>> 14/03/04 16:43:16 INFO Executor: Finished task ID 945
>>>>>>>> 14/03/04 16:43:19 INFO BlockManager: Removing RDD 66
>>>>>>>>
>>>>>>>> not sure what to make of this. any suggestions? best, koert
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message