hama-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hama Wiki] Update of "BSPMaster" by ChiaHungLin
Date Sat, 24 Feb 2018 17:28:15 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hama Wiki" for change notification.

The "BSPMaster" page has been changed by ChiaHungLin:

   * Stopped
  {{attachment:MasterStateTransition.png|BSPMaster State}}
- == Scenario ==
-  * Restart
-   * When a '''reported''' task fails on a groom server, restart that job by re-running '''all'''
tasks from the latest checkpoint that universally available. The reason not merely re-running
the task that fails comes from the fact that universally available checkpoint may not be only
one step behind the current superstep. This may lead to the deadlock between alive tasks and
the restarted one during sync phase. For example, the universally checkpoint available is
the 6th superstep, and currently running the computation from the 7th to 8th superstep. Suppose
one of the tasks fails, then the system migrates the failed task to another machine and resumes
the failed task from the 6th superstep checkpoint whilst other tasks keep continuously running
until hitting the barrier sync at the superstep 8th. Now the dead lock is raised when the
resumed task, that previous fails, hits the barrier sync at the superstep 7th because no other
tasks are at the superstep 7th. There is one proposed solution to fix a task failure issue.
A more complicated logic can be applied for this issue, but right now may just implement the
simpler one. 
  == Source ==

View raw message