spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ivan Petrov <>
Subject Is there any possibility to avoid double computation in case of RDD checkpointing
Date Sun, 16 Aug 2020 22:45:39 GMT
i use RDD checkpoint before writing to mongo to avoid duplicate records in
DB. Seems like Driver writes the same data twice in case of task failure.
- data calculated
- mongo _id created
- spark mongo connector writes data to Mongo
- task crashes
- (BOOM!) spark recomputes partition and gets new _id for mongo records
- i get duplicate records in Mongo

So I've added a checkpoint before writing to mongo.
Now Spark doubled execution runtime because of checkpoint.
What is the right way to avoid it? i think to save data to HDFS and then
read and write it to mongo instead of using checkpoint...
is it viable idea?

View raw message