spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Romi Kuntsman <r...@totango.com>
Subject Re: Some spark apps fail with "All masters are unresponsive", while others pass normally
Date Mon, 09 Nov 2015 16:30:10 GMT
I didn't see anything about a OOM.
This happens sometimes before anything in the application happened, and
happens to a few applications at the same time - so I guess it's a
communication failure, but the problem is that the error shown doesn't
represent the actual problem (which may be a network timeout etc)

*Romi Kuntsman*, *Big Data Engineer*
http://www.totango.com

On Mon, Nov 9, 2015 at 6:00 PM, Akhil Das <akhil@sigmoidanalytics.com>
wrote:

> Did you find anything regarding the OOM in the executor logs?
>
> Thanks
> Best Regards
>
> On Mon, Nov 9, 2015 at 8:44 PM, Romi Kuntsman <romi@totango.com> wrote:
>
>> If they have a problem managing memory, wouldn't there should be a OOM?
>> Why does AppClient throw a NPE?
>>
>> *Romi Kuntsman*, *Big Data Engineer*
>> http://www.totango.com
>>
>> On Mon, Nov 9, 2015 at 4:59 PM, Akhil Das <akhil@sigmoidanalytics.com>
>> wrote:
>>
>>> Is that all you have in the executor logs? I suspect some of those jobs
>>> are having a hard time managing  the memory.
>>>
>>> Thanks
>>> Best Regards
>>>
>>> On Sun, Nov 1, 2015 at 9:38 PM, Romi Kuntsman <romi@totango.com> wrote:
>>>
>>>> [adding dev list since it's probably a bug, but i'm not sure how to
>>>> reproduce so I can open a bug about it]
>>>>
>>>> Hi,
>>>>
>>>> I have a standalone Spark 1.4.0 cluster with 100s of applications
>>>> running every day.
>>>>
>>>> From time to time, the applications crash with the following error (see
>>>> below)
>>>> But at the same time (and also after that), other applications are
>>>> running, so I can safely assume the master and workers are working.
>>>>
>>>> 1. why is there a NullPointerException? (i can't track the scala stack
>>>> trace to the code, but anyway NPE is usually a obvious bug even if there's
>>>> actually a network error...)
>>>> 2. why can't it connect to the master? (if it's a network timeout, how
>>>> to increase it? i see the values are hardcoded inside AppClient)
>>>> 3. how to recover from this error?
>>>>
>>>>
>>>>   ERROR 01-11 15:32:54,991    SparkDeploySchedulerBackend - Application
>>>> has been killed. Reason: All masters are unresponsive! Giving up. ERROR
>>>>   ERROR 01-11 15:32:55,087              OneForOneStrategy - ERROR
>>>> logs/error.log
>>>>   java.lang.NullPointerException NullPointerException
>>>>       at
>>>> org.apache.spark.deploy.client.AppClient$ClientActor$$anonfun$receiveWithLogging$1.applyOrElse(AppClient.scala:160)
>>>>       at
>>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>>>>       at
>>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>>>>       at
>>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>>>>       at
>>>> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59)
>>>>       at
>>>> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
>>>>       at
>>>> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>>>>       at
>>>> org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
>>>>       at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>>>>       at
>>>> org.apache.spark.deploy.client.AppClient$ClientActor.aroundReceive(AppClient.scala:61)
>>>>       at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>>>>       at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>>>>       at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
>>>>       at akka.dispatch.Mailbox.run(Mailbox.scala:220)
>>>>       at
>>>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
>>>>       at
>>>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>>>       at
>>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>>>       at
>>>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>>>       at
>>>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>>>   ERROR 01-11 15:32:55,603                   SparkContext - Error
>>>> initializing SparkContext. ERROR
>>>>   java.lang.IllegalStateException: Cannot call methods on a stopped
>>>> SparkContext
>>>>       at org.apache.spark.SparkContext.org
>>>> $apache$spark$SparkContext$$assertNotStopped(SparkContext.scala:103)
>>>>       at
>>>> org.apache.spark.SparkContext.getSchedulingMode(SparkContext.scala:1501)
>>>>       at
>>>> org.apache.spark.SparkContext.postEnvironmentUpdate(SparkContext.scala:2005)
>>>>       at org.apache.spark.SparkContext.<init>(SparkContext.scala:543)
>>>>       at
>>>> org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:61)
>>>>
>>>>
>>>> Thanks!
>>>>
>>>> *Romi Kuntsman*, *Big Data Engineer*
>>>> http://www.totango.com
>>>>
>>>
>>>
>>
>

Mime
View raw message