hama-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hama Wiki] Update of "GroomServerFaultTolerance" by ChiaHungLin
Date Fri, 08 Apr 2011 11:56:58 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hama Wiki" for change notification.

The "GroomServerFaultTolerance" page has been changed by ChiaHungLin.
http://wiki.apache.org/hama/GroomServerFaultTolerance?action=diff&rev1=4&rev2=5

--------------------------------------------------

  
   1. Whilst executing a task, the task will periodically ping its parent GroomServer. 
   1. If the GroomServer does not receive ping from the child (with timeout), it checks if
child jvm is running; for instance, execute jps to identify child's status. 
-  1. GroomServer reports failure back to NodeMonitor. 
-  1. NodeMonitor notifies TaskScheduler that a task failure. 
+  1. GroomServer notifies TaskScheduler that a task failure.
   1. TaskScheduler updates JobInProgress.
   1. TaskScheduler reschedules task to another GroomServer by searching an appropriate GroomServer.
   1. If task rescheduled reaches the limit, the whole job fails.
  
  '''GroomServer Failure'''
  
-  1. NodeManager embedded in the GroomServer periodically sends heartbeat to NodeMonitor
in BSPMaster. 
+  1. NodeManager embedded in the GroomServer periodically sends heartbeat to NodeMonitor
in BSPMaster. [[https://issues.apache.org/jira/browse/HAMA-370|Hama-370]]
   1. One of GroomServers fails, indicating BSPMaster loses heartbeat from a particular GroomServer.

-  1. NodeMonitor collects metrics information, including CPU, memory, tasks, etc., from healthy
NodeManagers. 
+  1. NodeMonitor [[https://issues.apache.org/jira/browse/HAMA-363|Hama-363]] collects metrics
information, including CPU, memory, tasks, etc., from healthy NodeManagers. 
   1. Dispatch task(s) to GroomServer(s). 
      i. NodeMonitor notifies TaskScheduler the failure of GroomServers; and move failure
GroomServer to black list (will move back when the failed GroomServer restarts).
      i. TaskScheduler searches node list looking for GroomServer(s) whose workload is not
heavy (which GroomServer to go is corresponded to policy).

Mime
View raw message