flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kostas Kloudas <k.klou...@data-artisans.com>
Subject Re: Job restart hook
Date Wed, 04 Apr 2018 08:45:50 GMT

Hi Navneeth,

I am sending the answer to the user mailing list so that we keep the discussion public.
There may also be other users interested in the question.

So the answer to the question is that you cannot restart from an externalized checkpoint 
with a different parallelism. To be able to do so, you have to take a savepoint. 
You can find more on this in [1].

Thanks,
Kostas

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.4/ops/state/checkpoints.html
<https://ci.apache.org/projects/flink/flink-docs-release-1.4/ops/state/checkpoints.html>

> On Apr 3, 2018, at 7:40 PM, Navneeth Krishnan <reachnavneeth2@gmail.com> wrote:
> 
> Thanks a lot Kostas, the issue we are facing is sometimes it takes a very long time to
bring up the TM and we don't want to stall the entire job until the TM is back up. Thats why
we wanted to explore this options and see if it works. One small question on the same is can
we restore from checkpoints with different parallelism?
> 
> On Tue, Apr 3, 2018 at 2:48 AM, Kostas Kloudas <k.kloudas@data-artisans.com <mailto:k.kloudas@data-artisans.com>>
wrote:
> Hi Navneeth,
> 
> If I understand correctly, you have a job with parallelism p=20, a TM goes down (eg.
with 4 slots),
> and you want until the TM comes up, to run the job with p=16 and then re-running it with
20 again,
> when the TM comes up.
> 
> If this is the case, one important thing to keep in mind is that when a TM fails, the
whole job restarts,
> and not only the tasks that were running on that TM.
> 
> Given this, and assuming that the lost TM will not take long until it comes up, I am
not sure
> if you save anything by starting a job with parallelism = 20, then restarting it with
parallelism
> of 16 (in your example) until the TM comes up, and then taking a savepoint, stopping
it and
> restarting it with parallelism 20 again.
> 
> If you still want to do it, one way you can can do it, is to use the REST API to get
the necessary
> information about your cluster and the state of your job and write a script that takes
the necessary
> actions, e.g. resubmit a job with different parallelism.
> 
> I hope this helps,
> Kostas
> 
> > On Mar 29, 2018, at 8:02 PM, Navneeth Krishnan <reachnavneeth2@gmail.com <mailto:reachnavneeth2@gmail.com>>
wrote:
> >
> > Hi,
> >
> > Is there a way for a script to be called whenever a job gets restarted? My scenario
is lets say there are 20 slots and the job runs on all 20 slots. After a while a task manager
goes down and now there are only 14 slots and I need to readjust the parallelism of my job
to ensure the job runs until the lost TM comes up again. It would be great to know how others
are handling this situation.
> >
> > Thanks,
> > Navneeth
> 
> 


Mime
View raw message