spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Koert Kuipers <ko...@tresata.com>
Subject Re: trying to understand job cancellation
Date Thu, 06 Mar 2014 13:44:01 GMT
its 0.9 snapshot from january running in standalone mode.

have these fixed been merged into 0.9?


On Thu, Mar 6, 2014 at 12:45 AM, Matei Zaharia <matei.zaharia@gmail.com>wrote:

> Which version of Spark is this in, Koert? There might have been some fixes
> more recently for it.
>
> Matei
>
> On Mar 5, 2014, at 5:26 PM, Koert Kuipers <koert@tresata.com> wrote:
>
> Sorry I meant to say: seems the issue is shared RDDs between a job that
> got cancelled and a later job.
>
> However even disregarding that I have the other issue that the active task
> of the cancelled job hangs around forever, not doing anything....
> On Mar 5, 2014 7:29 PM, "Koert Kuipers" <koert@tresata.com> wrote:
>
>> yes jobs on RDDs that were not part of the cancelled job work fine.
>>
>> so it seems the issue is the cached RDDs that are ahred between the
>> cancelled job and the jobs after that.
>>
>>
>> On Wed, Mar 5, 2014 at 7:15 PM, Koert Kuipers <koert@tresata.com> wrote:
>>
>>> well, the new jobs use existing RDDs that were also used in the jon that
>>> got killed.
>>>
>>> let me confirm that new jobs that use completely different RDDs do not
>>> get killed.
>>>
>>>
>>>
>>> On Wed, Mar 5, 2014 at 7:00 PM, Mayur Rustagi <mayur.rustagi@gmail.com>wrote:
>>>
>>>> Quite unlikely as jobid are given in an incremental fashion, so your
>>>> future jobid are not likely to be killed if your groupid is not repeated.I
>>>> guess the issue is something else.
>>>>
>>>> Mayur Rustagi
>>>> Ph: +1 (760) 203 3257
>>>> http://www.sigmoidanalytics.com
>>>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>>>
>>>>
>>>>
>>>> On Wed, Mar 5, 2014 at 3:50 PM, Koert Kuipers <koert@tresata.com>wrote:
>>>>
>>>>> i did that. my next job gets a random new group job id (a uuid).
>>>>> however that doesnt seem to stop the job from getting sucked into the
>>>>> cancellation it seems
>>>>>
>>>>>
>>>>> On Wed, Mar 5, 2014 at 6:47 PM, Mayur Rustagi <mayur.rustagi@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> You can randomize job groups as well. to secure yourself against
>>>>>> termination.
>>>>>>
>>>>>> Mayur Rustagi
>>>>>> Ph: +1 (760) 203 3257
>>>>>> http://www.sigmoidanalytics.com
>>>>>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Mar 5, 2014 at 3:42 PM, Koert Kuipers <koert@tresata.com>wrote:
>>>>>>
>>>>>>> got it. seems like i better stay away from this feature for now..
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Mar 5, 2014 at 5:55 PM, Mayur Rustagi <
>>>>>>> mayur.rustagi@gmail.com> wrote:
>>>>>>>
>>>>>>>> One issue is that job cancellation is posted on eventloop.
So its
>>>>>>>> possible that subsequent jobs submitted to job queue may
beat the job
>>>>>>>> cancellation event & hence the job cancellation event
may end up closing
>>>>>>>> them too.
>>>>>>>> So there's definitely a race condition you are risking even
if not
>>>>>>>> running into.
>>>>>>>>
>>>>>>>> Mayur Rustagi
>>>>>>>> Ph: +1 (760) 203 3257
>>>>>>>> http://www.sigmoidanalytics.com
>>>>>>>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Mar 5, 2014 at 2:40 PM, Koert Kuipers <koert@tresata.com>wrote:
>>>>>>>>
>>>>>>>>> SparkContext.cancelJobGroup
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Mar 5, 2014 at 5:32 PM, Mayur Rustagi <
>>>>>>>>> mayur.rustagi@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> How do you cancel the job. Which API do you use?
>>>>>>>>>>
>>>>>>>>>> Mayur Rustagi
>>>>>>>>>> Ph: +1 (760) 203 3257
>>>>>>>>>> http://www.sigmoidanalytics.com
>>>>>>>>>>  @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Mar 5, 2014 at 2:29 PM, Koert Kuipers <koert@tresata.com>wrote:
>>>>>>>>>>
>>>>>>>>>>> i also noticed that jobs (with a new JobGroupId)
which i run
>>>>>>>>>>> after this use which use the same RDDs get very
confused. i see lots of
>>>>>>>>>>> cancelled stages and retries that go on forever.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Mar 4, 2014 at 5:02 PM, Koert Kuipers
<koert@tresata.com
>>>>>>>>>>> > wrote:
>>>>>>>>>>>
>>>>>>>>>>>> i have a running job that i cancel while
keeping the spark
>>>>>>>>>>>> context alive.
>>>>>>>>>>>>
>>>>>>>>>>>> at the time of cancellation the active stage
is 14.
>>>>>>>>>>>>
>>>>>>>>>>>> i see in logs:
>>>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.DAGScheduler:
Asked to
>>>>>>>>>>>> cancel job group 3a25db23-2e39-4497-b7ab-b26b2a976f9c
>>>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl:
>>>>>>>>>>>> Cancelling stage 10
>>>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl:
>>>>>>>>>>>> Cancelling stage 14
>>>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl:
Stage 14
>>>>>>>>>>>> was cancelled
>>>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl:
Remove
>>>>>>>>>>>> TaskSet 14.0 from pool x
>>>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl:
>>>>>>>>>>>> Cancelling stage 13
>>>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl:
>>>>>>>>>>>> Cancelling stage 12
>>>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl:
>>>>>>>>>>>> Cancelling stage 11
>>>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl:
>>>>>>>>>>>> Cancelling stage 15
>>>>>>>>>>>>
>>>>>>>>>>>> so far it all looks good. then i get a lot
of messages like
>>>>>>>>>>>> this:
>>>>>>>>>>>> 2014/03/04 16:43:20 INFO scheduler.TaskSchedulerImpl:
Ignoring
>>>>>>>>>>>> update with state FINISHED from TID 883 because
its task set is gone
>>>>>>>>>>>> 2014/03/04 16:43:24 INFO scheduler.TaskSchedulerImpl:
Ignoring
>>>>>>>>>>>> update with state KILLED from TID 888 because
its task set is gone
>>>>>>>>>>>>
>>>>>>>>>>>> after this stage 14 hangs around in active
stages, without any
>>>>>>>>>>>> sign of progress or cancellation. it just
sits there forever, stuck.
>>>>>>>>>>>> looking at the logs of the executors confirms
this. they task seem to be
>>>>>>>>>>>> still running, but nothing is happening.
for example (by the time i look at
>>>>>>>>>>>> this its 4:58 so this tasks hasnt done anything
in 15 mins):
>>>>>>>>>>>>
>>>>>>>>>>>> 14/03/04 16:43:16 INFO Executor: Serialized
size of result for
>>>>>>>>>>>> 943 is 1007
>>>>>>>>>>>> 14/03/04 16:43:16 INFO Executor: Sending
result for 943
>>>>>>>>>>>> directly to driver
>>>>>>>>>>>> 14/03/04 16:43:16 INFO Executor: Finished
task ID 943
>>>>>>>>>>>> 14/03/04 16:43:16 INFO Executor: Serialized
size of result for
>>>>>>>>>>>> 945 is 1007
>>>>>>>>>>>> 14/03/04 16:43:16 INFO Executor: Sending
result for 945
>>>>>>>>>>>> directly to driver
>>>>>>>>>>>> 14/03/04 16:43:16 INFO Executor: Finished
task ID 945
>>>>>>>>>>>> 14/03/04 16:43:19 INFO BlockManager: Removing
RDD 66
>>>>>>>>>>>>
>>>>>>>>>>>> not sure what to make of this. any suggestions?
best, koert
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message