hama-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hama Wiki] Update of "GroomServerFaultTolerance" by ChiaHungLin
Date Sun, 24 Apr 2011 12:16:25 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hama Wiki" for change notification.

The "GroomServerFaultTolerance" page has been changed by ChiaHungLin.


  Distributed computing system such as MapReduce[1], and Dryad[2] provide fault tolerance
feature to help the system survive over the process crash. It is particular useful when computation
requires to finish its execution in long time. Hama, based on the BSP[3] model, is a framework
for massive scientific computations, which also requires this feature so that developers and
users who exploit this framework can benefit from it. This page serves for providing information
on direction how Hama GroomServer fault tolerance would work. 
  === Literature Review ===
+ In general, a system designed to deal with failures largely bases on the concepts including
unit of mitigation, redundancy, fault observer[4]. 
+ The architecture defines the basic unit which performs functions of a system according to
+ Providing redundant units. 
+ Fault observers are designed to detect fault or error in an earlier stage so that other
strategies, such as error recovery can be employed to correct the problem. 
@@ -47, +56 @@

  [3]. Bulk Synchronous Parallel Computing -- A Paradigm for Transportable Software. http://portal.acm.org/citation.cfm?id=798134
+ [4]. Patterns for Fault Tolerant Software. http://portal.acm.org/citation.cfm?id=1557393
+ [5]. Supervisor Behaviour. http://www.erlang.org/doc/design_principles/sup_princ.html
+ [6]. Extensible Resource Management For Cluster Computing. http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=603418

View raw message