From Shay Seng <>
Subject persist before or after checkpoint?
Date Wed, 24 Sep 2014 19:26:44 GMT

I actually have 2 question

(1)  I want to generate unique IDs for each RDD element and I want to
assign them in parallel so I do

rdd.mapPartitionsWithIndex((index, s) => {
      var count = 0L {
        case (t, i) => {
          count += 1
          (index * GLOBAL.MAX_PARTITION_SIZE + i, t)

This works ok, but we noticed that unless we checkpoint, if a partition is
recomputed, the IDs will get messed up.

Question 1: is there a better way to create unique IDs in a distributed way?

(2) To solve the stability issue with (1) we did:


The Spark logs suggested that checkpointed RDDs be persisted. should the
persist be before or after checkpointing?

ok I lied,I  have 3 question
(3) We are checkpointing to HDFS. we've noticed that sometimes the
checkpointing works and I see /RDD-1 etc written in HDFS, but other times
we only see the checkpoint dir created and not data ... I suspect (2) but
I'm not certain what is really happening.

Any pointers would be appreciated.
I'm using AWS r3.4xlarge machines with Spark 0.9.2


