spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adrian Tanase <>
Subject FW: Spark streaming - failed recovery from checkpoint
Date Mon, 02 Nov 2015 15:40:40 GMT
Re-posting here, didn’t get any feedback on the dev list.

Has anyone experienced corrupted checkpoints recently?


From: Adrian Tanase
Date: Thursday, October 29, 2015 at 1:38 PM
To: "<>"
Subject: Spark streaming - failed recovery from checkpoint

Hi guys,

I’ve encountered some problems with a crashed Spark Streaming job, when restoring from checkpoint.
I’m runnning spark 1.5.1 on Yarn (hadoop 2.6) in cluster mode, reading from Kafka with the
direct consumer and a few updateStateByKey stateful transformations.

After investigating, I think the following happened:

  *   Active ResourceManager crashed (aws machine crashed)
  *   10 minutes later — default Yarn settings :( — Standby took over and redeployed the
job, sending a SIGTERM to the running driver
  *   Recovery from checkpoint failed because of missing RDD in checkpoint folder

One complication - UNCONFIRMED because of missing logs – I believe that the new driver was
started ~5 minutes before the old one stopped.

With your help, I’m trying to zero in on a root cause or a combination of:

  *   bad Yarn/Spark configuration (10 minutes to react to missing node, already fixed through
more aggressive liveliness settings)
  *   YARN fact of life – why is running job redeployed when standby RM takes over?
  *   Bug/race condition in spark checkpoint cleanup/recovery? (why is RDD cleaned up by the
old app and then recovery fails when it looks for it?)
  *   Bugs in the Yarn-Spark integration (missing heartbeats? Why is the new app started 5
minutes before the old one dies?)
  *   Application code – should we add graceful shutdown? Should I add a Zookeeper lock
that prevents 2 instances of the driver starting at the same time?

Sorry if the questions are a little all over the place, getting to the root cause of this
was a pain and I can’t even log an issue in Jira without your help.

Attaching some logs that showcase the checkpoint recovery failure (I’ve grepped for “checkpoint”
to highlight the core issue):

  *   Driver logs prior to shutdown:
  *   Driver logs, failed recovery:
Other info:
     *   spark.streaming.unpersist = true
     *   spark.cleaner.ttl = 259200 (3 days)

Last question – in the checkpoint recovery process I notice that it’s going back ~6 minutes
on the persisted RDDs and ~10 minutes to replay from kafka.
I’m running with 20 second batches and 100 seconds checkpoint interval (small issue - one
of the RDDs was using the default interval of 20 secs). Shouldn’t the lineage be a lot smaller?
Based on the documentation I would have expected that the recovery goes back at most 100 seconds,
as I’m not doing any windowed operations…

Thanks in advance!
View raw message