[adding dev list since it's probably a bug, but i'm not sure how to reproduce so I can open a bug about it]

Hi,

I have a standalone Spark 1.4.0 cluster with 100s of applications running every day.

From time to time, the applications crash with the following error (see below)
But at the same time (and also after that), other applications are running, so I can safely assume the master and workers are working.

1. why is there a NullPointerException? (i can't track the scala stack trace to the code, but anyway NPE is usually a obvious bug even if there's actually a network error...)
2. why can't it connect to the master? (if it's a network timeout, how to increase it? i see the values are hardcoded inside AppClient)
3. how to recover from this error?


  ERROR 01-11 15:32:54,991    SparkDeploySchedulerBackend - Application has been killed. Reason: All masters are unresponsive! Giving up. ERROR
  ERROR 01-11 15:32:55,087              OneForOneStrategy - ERROR logs/error.log
  java.lang.NullPointerException NullPointerException
      at org.apache.spark.deploy.client.AppClient$ClientActor$$anonfun$receiveWithLogging$1.applyOrElse(AppClient.scala:160)
      at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
      at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
      at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
      at org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59)
      at org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
      at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
      at org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
      at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
      at org.apache.spark.deploy.client.AppClient$ClientActor.aroundReceive(AppClient.scala:61)
      at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
      at akka.actor.ActorCell.invoke(ActorCell.scala:487)
      at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
      at akka.dispatch.Mailbox.run(Mailbox.scala:220)
      at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
      at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
      at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
      at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
      at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
  ERROR 01-11 15:32:55,603                   SparkContext - Error initializing SparkContext. ERROR
  java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext
      at org.apache.spark.SparkContext.org$apache$spark$SparkContext$$assertNotStopped(SparkContext.scala:103)
      at org.apache.spark.SparkContext.getSchedulingMode(SparkContext.scala:1501)
      at org.apache.spark.SparkContext.postEnvironmentUpdate(SparkContext.scala:2005)
      at org.apache.spark.SparkContext.<init>(SparkContext.scala:543)
      at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:61)

Thanks!

Romi Kuntsman, Big Data Engineer