hama-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hama Wiki] Update of "GroomServerFaultTolerance" by ChiaHungLin
Date Fri, 08 Apr 2011 11:52:00 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hama Wiki" for change notification.

The "GroomServerFaultTolerance" page has been changed by ChiaHungLin.
http://wiki.apache.org/hama/GroomServerFaultTolerance?action=diff&rev1=3&rev2=4

--------------------------------------------------

  
  === Introduction ===
  
- Distributed computing system such as Hadoop[1], and Dryad[2] provide fault tolerance feature
to help the system survive over the process crash. It is particular useful when computation
requires to finish its execution in long time. Hama, based on the BSP[3] model, is a framework
for massive scientific computations, which also requires this feature so that developers and
users who exploit this framework can benefit from it. This page serves for providing information
on direction how Hama GroomServer fault tolerance would work. 
+ Distributed computing system such as MapReduce[1], and Dryad[2] provide fault tolerance
feature to help the system survive over the process crash. It is particular useful when computation
requires to finish its execution in long time. Hama, based on the BSP[3] model, is a framework
for massive scientific computations, which also requires this feature so that developers and
users who exploit this framework can benefit from it. This page serves for providing information
on direction how Hama GroomServer fault tolerance would work. 
  
  === Literature Review ===
  
  
  
  === Architecture ===
+ '''Task Failure'''
  
- * NodeMaanger embedded in the GroomServer periodically sends heartbeat to NodeMonitor in
BSPMaster. // Can't attach diagram 
+ The execution of a task is spawned from the GroomServer so that the failure of the task
would not pull down the GroomServer. Following steps are performed in the senario of task
failure.
  
- * One of GroomServers fails, indicating BSPMaster loses heartbeat from a particular GroomServer.
// Can't attach diagram 
+  1. Whilst executing a task, the task will periodically ping its parent GroomServer. 
+  1. If the GroomServer does not receive ping from the child (with timeout), it checks if
child jvm is running; for instance, execute jps to identify child's status. 
+  1. GroomServer reports failure back to NodeMonitor. 
+  1. NodeMonitor notifies TaskScheduler that a task failure. 
+  1. TaskScheduler updates JobInProgress.
+  1. TaskScheduler reschedules task to another GroomServer by searching an appropriate GroomServer.
+  1. If task rescheduled reaches the limit, the whole job fails.
  
- * NodeMonitor collects metrics information, including CPU, memory, tasks, etc., from healthy
NodeManagers. // Can't attach diagram 
+ '''GroomServer Failure'''
  
- * Dispatch task(s) to GroomServer(s). // Can't attach diagram 
- 
+  1. NodeManager embedded in the GroomServer periodically sends heartbeat to NodeMonitor
in BSPMaster. 
+  1. One of GroomServers fails, indicating BSPMaster loses heartbeat from a particular GroomServer.

+  1. NodeMonitor collects metrics information, including CPU, memory, tasks, etc., from healthy
NodeManagers. 
+  1. Dispatch task(s) to GroomServer(s). 
- 1. NodeMonitor notifies TaskScheduler the failure of GroomServers; and move failure GroomServer
to black list (will move back when the failed GroomServer restarts).
+     i. NodeMonitor notifies TaskScheduler the failure of GroomServers; and move failure
GroomServer to black list (will move back when the failed GroomServer restarts).
- 
- 2. TaskScheduler searches node list looking for GroomServer(s) whose workload is not heavy
(which GroomServer to go is corresponded to policy). 
+     i. TaskScheduler searches node list looking for GroomServer(s) whose workload is not
heavy (which GroomServer to go is corresponded to policy).
- 
- 3. Update task(s) JobInProgress by assigning failed tasks to the GroomServer found in previous
step. 
+     i. Update task(s) JobInProgress by assigning failed tasks to the GroomServer found in
previous step.
- 
- 4. Dispatch task(s) to designed GroomServer(s).
+     i. Dispatch task(s) to designed GroomServer(s).
- 
- 
-   
- 
- 
- 
- 
  
  === Glossary ===
  
@@ -43, +42 @@

  
  
  === References ===
- [1]. Hadoop. http://hadoop.apache.org/
+ [1]. MapReduce: simplified data processing on large clusters. http://portal.acm.org/citation.cfm?id=1327492
  
  [2]. Dryad: distributed data-parallel programs from sequential building blocks. http://portal.acm.org/citation.cfm?id=1273005
  

Mime
View raw message