giraph-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (GIRAPH-1139) Resuming from checkpoint doesn't work
Date Mon, 27 Mar 2017 20:25:41 GMT

    [ https://issues.apache.org/jira/browse/GIRAPH-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15943980#comment-15943980
] 

ASF GitHub Bot commented on GIRAPH-1139:
----------------------------------------

GitHub user neggert opened a pull request:

    https://github.com/apache/giraph/pull/30

    [GIRAPH-1139] Fix resuming from checkpoint

    A couple of fixes that get resuming from checkpoint working.
    
    * Set checkpointStatus to NONE in master when restarting from checkpoint.
    
    Workers already do this, so the job hangs when restarting from checkpoint
    while the master waits for workers to create checkpoints they're never
    going to create.
    
    * Set unique task id for each worker attempt
    
    Previously, a worker would reuse the task id from the prior attempt. This
    gets propagated to the Netty client id, which makes the master think it has
    already processed any requests that come from that client, causing it to
    discard them. This obviously causes problems.
    
    And also a fix for GIRAPH-1136. We will now checkpoint on superstep 0 if checkpointing
is enabled. Let me know if you'd rather I sent a separate PR for this.
    
    Testing:
    Ran custom Label Propagation implementation with checkpointing on a ~5b node graph. Manually
killed workers (by logging in to worker node and running `kill -9 <pid>`. Ensured that
Giraph successfully resumed from most recent checkpoint.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/neggert/giraph trunk_resume_fixes

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/giraph/pull/30.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #30
    
----
commit 6462d48e46d84cd6aa5ecd5817b0d057ce3a6c1f
Author: NicEggert <nicholas.eggert@target.com>
Date:   2017-03-23T20:23:03Z

    Set checkpointStatus to NONE in master when restarting from checkpoint.
    
    Workers already do this, so the job hangs when restarting from checkpoint
    while the master waits for workers to create checkpoints they're never
    going to create.

commit 3ed8c18a3bc97c910e364bf7d48d50be25df704c
Author: NicEggert <nicholas.eggert@target.com>
Date:   2017-03-23T20:26:02Z

    Checkpoint on superstep 0 if checkpointing is enabled

commit 74bba4573dbb77242d81352f84969b114db1cb71
Author: NicEggert <nicholas.eggert@target.com>
Date:   2017-03-23T20:26:47Z

    Set unique task id for each worker attempt
    
    Previously, a worker would reuse the task id from the prior attempt. This
    gets propagated to the Netty client id, which makes the master think it has
    already processed any requests that come from that client, causing it to
    discard them. This obviously causes problems.

----


> Resuming from checkpoint doesn't work
> -------------------------------------
>
>                 Key: GIRAPH-1139
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-1139
>             Project: Giraph
>          Issue Type: Bug
>          Components: bsp
>    Affects Versions: 1.2.0
>            Reporter: Nic Eggert
>
> I ran into a couple of issues when trying to get Giraph to resume from checkpoints (using
mapreduce.max.attempts rather than GiraphJobRetryChecker).
> * If we just wrote a checkpoint, the master expects the workers to checkpoint again,
while the workers (correctly) clear the checkpointing flag.
> * When workers restart, they take their task id from the partition number, which stays
the same across multiple attempts. This gets transferred to the Netty clientId, and the server
starts ignoring messages from restarted workers because it thinks it processed them already.
> I believe I've fixed these issues. I'll send a GitHub PR shortly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message