helix-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From j...@apache.org
Subject [2/3] helix git commit: [HELIX-786] TASK: Fix stuck tasks after Participant connection loss
Date Fri, 02 Nov 2018 21:26:32 GMT
[HELIX-786] TASK: Fix stuck tasks after Participant connection loss

When Helix Participants lose ZK connection and enter a new ZK session, that causes all task
partitions on those Participants to be reset into INIT state. This is undesirable because
in reality, these tasks are considered dropped and should be scheduled on some other instance.
This is the Controller side fix for this problem: when we detect tasks whose assigned Participants
are no longer live, we mark them as DROPPED in their parent JobContext so that AssignableInstance
will not consider them active when it is refreshed in the next pipeline. This enables these
dropped tasks to be reassigned onto other instances.

Note that a Participant-side fix must follow so that upon reset() on task partitions, they
should be in DROPPED state, not in INIT state. This does not inherently solve stuck INIT states
on the original Participant. However, by letting these tasks be assigned on other instances,
this fix lets jobs and workflows complete, upon which their CurrentStates will be dropped

1. Mark task partitions whose assigned Participants are no longer live as DROPPED in JobContext

Project: http://git-wip-us.apache.org/repos/asf/helix/repo
Commit: http://git-wip-us.apache.org/repos/asf/helix/commit/dc25bac1
Tree: http://git-wip-us.apache.org/repos/asf/helix/tree/dc25bac1
Diff: http://git-wip-us.apache.org/repos/asf/helix/diff/dc25bac1

Branch: refs/heads/master
Commit: dc25bac1ebdcddb08aaab2765abfe72008b06a31
Parents: bced099
Author: narendly <narendly@gmail.com>
Authored: Fri Nov 2 14:03:16 2018 -0700
Committer: narendly <narendly@gmail.com>
Committed: Fri Nov 2 14:03:16 2018 -0700

 .../main/java/org/apache/helix/task/AbstractTaskDispatcher.java    | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/helix-core/src/main/java/org/apache/helix/task/AbstractTaskDispatcher.java b/helix-core/src/main/java/org/apache/helix/task/AbstractTaskDispatcher.java
index cbf9fb8..cb721e5 100644
--- a/helix-core/src/main/java/org/apache/helix/task/AbstractTaskDispatcher.java
+++ b/helix-core/src/main/java/org/apache/helix/task/AbstractTaskDispatcher.java
@@ -693,6 +693,8 @@ public abstract class AbstractTaskDispatcher {
       if (isTaskNotInTerminalState(state)) {
         String assignedParticipant = jobContext.getAssignedParticipant(partitionNumber);
         if (assignedParticipant != null && !liveInstances.contains(assignedParticipant))
+          // The assigned instance is no longer live, so mark it as DROPPED in the context
+          jobContext.setPartitionState(partitionNumber, TaskPartitionState.DROPPED);

View raw message