giraph-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nic Eggert (JIRA)" <>
Subject [jira] [Created] (GIRAPH-1139) Resuming from checkpoint doesn't work
Date Mon, 27 Mar 2017 20:20:41 GMT
Nic Eggert created GIRAPH-1139:

             Summary: Resuming from checkpoint doesn't work
                 Key: GIRAPH-1139
             Project: Giraph
          Issue Type: Bug
          Components: bsp
    Affects Versions: 1.2.0
            Reporter: Nic Eggert

I ran into a couple of issues when trying to get Giraph to resume from checkpoints (using
mapreduce.max.attempts rather than GiraphJobRetryChecker).

* If we just wrote a checkpoint, the master expects the workers to checkpoint again, while
the workers (correctly) clear the checkpointing flag.
* When workers restart, they take their task id from the partition number, which stays the
same across multiple attempts. This gets transferred to the Netty clientId, and the server
starts ignoring messages from restarted workers because it thinks it processed them already.

I believe I've fixed these issues. I'll send a GitHub PR shortly.

This message was sent by Atlassian JIRA

View raw message