i use RDD checkpoint before writing to mongo to avoid duplicate records in DB. Seems like Driver writes the same data twice in case of task failure.
- data calculated
- mongo _id created
- spark mongo connector writes data to Mongo
- task crashes
- (BOOM!) spark recomputes partition and gets new _id for mongo records
- i get duplicate records in Mongo
So I've added a checkpoint before writing to mongo.
Now Spark doubled execution runtime because of checkpoint.
What is the right way to avoid it? i think to save data to HDFS and then read and write it to mongo instead of using checkpoint...
is it viable idea?