hama-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hama Wiki] Update of "GroomServerFaultTolerance" by ChiaHungLin
Date Tue, 05 Apr 2011 11:08:00 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hama Wiki" for change notification.

The "GroomServerFaultTolerance" page has been changed by ChiaHungLin.
http://wiki.apache.org/hama/GroomServerFaultTolerance?action=diff&rev1=2&rev2=3

--------------------------------------------------

  
  === Architecture ===
  
+ * NodeMaanger embedded in the GroomServer periodically sends heartbeat to NodeMonitor in
BSPMaster. // Can't attach diagram 
+ 
+ * One of GroomServers fails, indicating BSPMaster loses heartbeat from a particular GroomServer.
// Can't attach diagram 
+ 
+ * NodeMonitor collects metrics information, including CPU, memory, tasks, etc., from healthy
NodeManagers. // Can't attach diagram 
+ 
+ * Dispatch task(s) to GroomServer(s). // Can't attach diagram 
+ 
+ 1. NodeMonitor notifies TaskScheduler the failure of GroomServers; and move failure GroomServer
to black list (will move back when the failed GroomServer restarts).
+ 
+ 2. TaskScheduler searches node list looking for GroomServer(s) whose workload is not heavy
(which GroomServer to go is corresponded to policy). 
+ 
+ 3. Update task(s) JobInProgress by assigning failed tasks to the GroomServer found in previous
step. 
+ 
+ 4. Dispatch task(s) to designed GroomServer(s).
+ 
+ 
+   
+ 
+ 
+ 
  
  
  === Glossary ===
  
- NodeManager
+ NodeMonitor: a component monitors the healthy of GroomServers. 
  
- Failure Detector
+ NodeManager: a component that collects metrics information whilst NodeMonitor requests to
report status of the GroomServer it runs on.
  
- Supervisor behaviour
  
  === References ===
  [1]. Hadoop. http://hadoop.apache.org/

Mime
View raw message