flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zhu Zhu (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (FLINK-11813) Standby per job mode Dispatchers don't know job's JobSchedulingStatus
Date Tue, 16 Apr 2019 02:55:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-11813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16818575#comment-16818575
] 

Zhu Zhu edited comment on FLINK-11813 at 4/16/19 2:54 AM:
----------------------------------------------------------

I think with SubmittedJobGraphStore been a underlying layer of RunningJobsRegistry, there
is no need to update the job status to RUNNING explicitly. We may wrap them in a *JobStore*
which not only provides submitted JobGraphs but also supports job running status queries.

There can be only 2 operations to the store:
 # _*addJob(submittedJobGraph)*_ to add a newly submitted JobGraph
 # _*markDone(jobID)*_ to mark the job status to be DONE, which should also be stored in the
SubmittedJobGraphStore (we can even drop the graph file and keep the DONE status only) (b.t.w.
the word *DONE* seems to mean that the job is FINISHED, not CANCELLED or FAILED, should we
use a more accurate work like *TERMINATED*?)

And the job running status would be transitioned as below:

*NONE* -- _addJob_ --> *RUNNING* – _markDone_ --> *DONE*

Actually the underly status is:
 # NONE: job graph does not exist
 # RUNNING: job graph exists and not DONE
 # DONE: job graph exists and DONE

For job mode, we may need to change current SingleJobSubmittedJobGraphStore to an HA SubmittedJobGraphStore,
which would then make the running status sharing possible.The job mode dispatcher(MiniDispatcher)
should add the embedded jobGraph to the JobStore once it is granted leadership(duplicated
jobGraph will be ignored).

 

 

 


was (Author: zhuzh):
I think with SubmittedJobGraphStore been a underlying layer RunningJobsRegistry unified, there
is no need to change the job status to RUNNING explicitly. Maybe we can wrap them in a *JobStore*
which not only provides submitted JobGraphs but also supports job running status queries.

There can be only 2 operations to the store:
 # _*addJob(submittedJobGraph)*_ to add a newly submitted JobGraph
 # _*markDone(jobID)*_ to mark the job status to be DONE, which should also be stored in the
SubmittedJobGraphStore (we can even drop the graph file and keep the DONE status only) (b.t.w.
the word *DONE* seems to mean that the job is FINISHED, not CANCELLED or FAILED, should we
use a more accurate work like *TERMINATED*?)

And the job running status would be transitioned as below:

*NONE* -- _addJob_ --> *RUNNING* – _markDone_ --> *DONE*

Actually the underly status is:
 # NONE: job graph does not exist
 # RUNNING: job graph exists and not DONE
 # DONE: job graph exists and DONE

For job mode, we may need to change current SingleJobSubmittedJobGraphStore to an HA SubmittedJobGraphStore,
which would then make the running status sharing possible.The job mode dispatcher(MiniDispatcher)
should add the embedded jobGraph to the JobStore once it is granted leadership(duplicated
jobGraph will be ignored).

 

 

 

> Standby per job mode Dispatchers don't know job's JobSchedulingStatus
> ---------------------------------------------------------------------
>
>                 Key: FLINK-11813
>                 URL: https://issues.apache.org/jira/browse/FLINK-11813
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.6.4, 1.7.2, 1.8.0
>            Reporter: Till Rohrmann
>            Priority: Major
>
> At the moment, it can happen that standby {{Dispatchers}} in per job mode will restart
a terminated job after they gained leadership. The problem is that we currently clear the
{{RunningJobsRegistry}} once a job has reached a globally terminal state. After the leading
{{Dispatcher}} terminates, a standby {{Dispatcher}} will gain leadership. Without having the
information from the {{RunningJobsRegistry}} it cannot tell whether the job has been executed
or whether the {{Dispatcher}} needs to re-execute the job. At the moment, the {{Dispatcher}}
will assume that there was a fault and hence re-execute the job. This can lead to duplicate
results.
> I think we need some way to tell standby {{Dispatchers}} that a certain job has been
successfully executed. One trivial solution could be to not clean up the {{RunningJobsRegistry}}
but then we will clutter ZooKeeper.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message