spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marcelo Vanzin <van...@cloudera.com>
Subject Re: Initial job has not accepted any resources
Date Thu, 07 Aug 2014 16:17:02 GMT
There are two problems that might be happening:

- You're requesting more resources than the master has available, so
your executors are not starting. Given your explanation this doesn't
seem to be the case.

- The executors are starting, but are having problems connecting back
to the driver. In this case, you should be able to see errors in each
executor's log file.


On Thu, Aug 7, 2014 at 9:11 AM, arnaudbriche <briche.arnaud@gmail.com> wrote:
> Hi,
>
> I'm trying a simple thing: create an RDD from a text file (~3GB) located in
> GlusterFS, which is mounted by all Spark cluster machines,  and calling
> rdd.count(); but Spark never managed to complete the job, giving message
> like the following: WARN TaskSchedulerImpl: Initial job has not accepted any
> resources; check your cluster UI to ensure that workers are registered and
> have sufficient memory
>
> I run a standalone Spark Cluster with 1 master node and 5 worker nodes;
> worker nodes are 12 cores, 64GB machine and I allocated 6 cores et 32GB to
> each Spark slave (1 per slave machine).
>
> I run spark-shell with the following command: spark-shell --master
> --driver-cores 6 --executor-memory 16g
>
> Following is my Spark shell session:
>
> scala> val f = sc.textFile("/mnt/backups/stats.json")
> 14/08/07 17:15:05 INFO MemoryStore: ensureFreeSpace(138763) called with
> curMem=0, maxMem=309225062
> 14/08/07 17:15:05 INFO MemoryStore: Block broadcast_0 stored as values to
> memory (estimated size 135.5 KB, free 294.8 MB)
> f: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at
> <console>:12
>
> scala> f.count()
> 14/08/07 17:15:18 INFO FileInputFormat: Total input paths to process : 1
> 14/08/07 17:15:18 INFO SparkContext: Starting job: count at <console>:15
> 14/08/07 17:15:18 INFO DAGScheduler: Got job 0 (count at <console>:15) with
> 38 output partitions (allowLocal=false)
> 14/08/07 17:15:18 INFO DAGScheduler: Final stage: Stage 0(count at
> <console>:15)
> 14/08/07 17:15:18 INFO DAGScheduler: Parents of final stage: List()
> 14/08/07 17:15:18 INFO DAGScheduler: Missing parents: List()
> 14/08/07 17:15:18 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[1] at
> textFile at <console>:12), which has no missing parents
> 14/08/07 17:15:18 INFO DAGScheduler: Submitting 38 missing tasks from Stage
> 0 (MappedRDD[1] at textFile at <console>:12)
> 14/08/07 17:15:18 INFO TaskSchedulerImpl: Adding task set 0.0 with 38 tasks
> 14/08/07 17:15:33 WARN TaskSchedulerImpl: Initial job has not accepted any
> resources; check your cluster UI to ensure that workers are registered and
> have sufficient memory
> 14/08/07 17:15:48 WARN TaskSchedulerImpl: Initial job has not accepted any
> resources; check your cluster UI to ensure that workers are registered and
> have sufficient memory
> 14/08/07 17:16:03 WARN TaskSchedulerImpl: Initial job has not accepted any
> resources; check your cluster UI to ensure that workers are registered and
> have sufficient memory
> 14/08/07 17:16:18 WARN TaskSchedulerImpl: Initial job has not accepted any
> resources; check your cluster UI to ensure that workers are registered and
> have sufficient memory
> 14/08/07 17:16:26 INFO AppClient$ClientActor: Executor updated:
> app-20140807171444-0002/4 is now EXITED (Command exited with code 1)
> 14/08/07 17:16:26 INFO SparkDeploySchedulerBackend: Executor
> app-20140807171444-0002/4 removed: Command exited with code 1
> 14/08/07 17:16:26 INFO AppClient$ClientActor: Executor added:
> app-20140807171444-0002/5 on worker-20140807155724-172.18.31.153-7778
> (172.18.31.153:7778) with 6 cores
> 14/08/07 17:16:26 INFO SparkDeploySchedulerBackend: Granted executor ID
> app-20140807171444-0002/5 on hostPort 172.18.31.153:7778 with 6 cores, 16.0
> GB RAM
> 14/08/07 17:16:26 INFO AppClient$ClientActor: Executor updated:
> app-20140807171444-0002/0 is now EXITED (Command exited with code 1)
> 14/08/07 17:16:26 INFO SparkDeploySchedulerBackend: Executor
> app-20140807171444-0002/0 removed: Command exited with code 1
> 14/08/07 17:16:26 INFO AppClient$ClientActor: Executor added:
> app-20140807171444-0002/6 on worker-20140807155724-172.22.56.186-7778
> (172.22.56.186:7778) with 6 cores
> 14/08/07 17:16:26 INFO SparkDeploySchedulerBackend: Granted executor ID
> app-20140807171444-0002/6 on hostPort 172.22.56.186:7778 with 6 cores, 16.0
> GB RAM
> 14/08/07 17:16:26 INFO AppClient$ClientActor: Executor updated:
> app-20140807171444-0002/1 is now EXITED (Command exited with code 1)
> 14/08/07 17:16:26 INFO SparkDeploySchedulerBackend: Executor
> app-20140807171444-0002/1 removed: Command exited with code 1
> 14/08/07 17:16:26 INFO AppClient$ClientActor: Executor added:
> app-20140807171444-0002/7 on worker-20140807155724-172.28.173.218-7778
> (172.28.173.218:7778) with 6 cores
> 14/08/07 17:16:26 INFO SparkDeploySchedulerBackend: Granted executor ID
> app-20140807171444-0002/7 on hostPort 172.28.173.218:7778 with 6 cores, 16.0
> GB RAM
> 14/08/07 17:16:26 INFO AppClient$ClientActor: Executor updated:
> app-20140807171444-0002/5 is now RUNNING
> 14/08/07 17:16:26 INFO AppClient$ClientActor: Executor updated:
> app-20140807171444-0002/6 is now RUNNING
> 14/08/07 17:16:26 INFO AppClient$ClientActor: Executor updated:
> app-20140807171444-0002/3 is now EXITED (Command exited with code 1)
> 14/08/07 17:16:26 INFO SparkDeploySchedulerBackend: Executor
> app-20140807171444-0002/3 removed: Command exited with code 1
> 14/08/07 17:16:26 INFO AppClient$ClientActor: Executor added:
> app-20140807171444-0002/8 on worker-20140807155724-172.23.64.98-7778
> (172.23.64.98:7778) with 6 cores
> 14/08/07 17:16:26 INFO SparkDeploySchedulerBackend: Granted executor ID
> app-20140807171444-0002/8 on hostPort 172.23.64.98:7778 with 6 cores, 16.0
> GB RAM
> 14/08/07 17:16:26 INFO AppClient$ClientActor: Executor updated:
> app-20140807171444-0002/7 is now RUNNING
> 14/08/07 17:16:26 INFO AppClient$ClientActor: Executor updated:
> app-20140807171444-0002/8 is now RUNNING
> 14/08/07 17:16:26 INFO AppClient$ClientActor: Executor updated:
> app-20140807171444-0002/2 is now EXITED (Command exited with code 1)
> 14/08/07 17:16:26 INFO SparkDeploySchedulerBackend: Executor
> app-20140807171444-0002/2 removed: Command exited with code 1
> 14/08/07 17:16:26 INFO AppClient$ClientActor: Executor added:
> app-20140807171444-0002/9 on worker-20140807155724-172.29.166.84-7778
> (172.29.166.84:7778) with 6 cores
> 14/08/07 17:16:26 INFO SparkDeploySchedulerBackend: Granted executor ID
> app-20140807171444-0002/9 on hostPort 172.29.166.84:7778 with 6 cores, 16.0
> GB RAM
> 14/08/07 17:16:26 INFO AppClient$ClientActor: Executor updated:
> app-20140807171444-0002/9 is now RUNNING
> 14/08/07 17:16:33 WARN TaskSchedulerImpl: Initial job has not accepted any
> resources; check your cluster UI to ensure that workers are registered and
> have sufficient memory
> 14/08/07 17:16:48 WARN TaskSchedulerImpl: Initial job has not accepted any
> resources; check your cluster UI to ensure that workers are registered and
> have sufficient memory
> 14/08/07 17:17:03 WARN TaskSchedulerImpl: Initial job has not accepted any
> resources; check your cluster UI to ensure that workers are registered and
> have sufficient memory
> 14/08/07 17:17:18 WARN TaskSchedulerImpl: Initial job has not accepted any
> resources; check your cluster UI to ensure that workers are registered and
> have sufficient memory
> 14/08/07 17:17:33 WARN TaskSchedulerImpl: Initial job has not accepted any
> resources; check your cluster UI to ensure that workers are registered and
> have sufficient memory
> 14/08/07 17:17:48 WARN TaskSchedulerImpl: Initial job has not accepted any
> resources; check your cluster UI to ensure that workers are registered and
> have sufficient memory
> 14/08/07 17:18:03 WARN TaskSchedulerImpl: Initial job has not accepted any
> resources; check your cluster UI to ensure that workers are registered and
> have sufficient memory
> 14/08/07 17:18:07 INFO AppClient$ClientActor: Executor updated:
> app-20140807171444-0002/5 is now EXITED (Command exited with code 1)
> 14/08/07 17:18:07 INFO SparkDeploySchedulerBackend: Executor
> app-20140807171444-0002/5 removed: Command exited with code 1
> 14/08/07 17:18:07 INFO AppClient$ClientActor: Executor added:
> app-20140807171444-0002/10 on worker-20140807155724-172.18.31.153-7778
> (172.18.31.153:7778) with 6 cores
> 14/08/07 17:18:07 INFO SparkDeploySchedulerBackend: Granted executor ID
> app-20140807171444-0002/10 on hostPort 172.18.31.153:7778 with 6 cores, 16.0
> GB RAM
> 14/08/07 17:18:07 INFO AppClient$ClientActor: Executor updated:
> app-20140807171444-0002/7 is now EXITED (Command exited with code 1)
> 14/08/07 17:18:07 INFO SparkDeploySchedulerBackend: Executor
> app-20140807171444-0002/7 removed: Command exited with code 1
> 14/08/07 17:18:07 INFO AppClient$ClientActor: Executor added:
> app-20140807171444-0002/11 on worker-20140807155724-172.28.173.218-7778
> (172.28.173.218:7778) with 6 cores
> 14/08/07 17:18:07 INFO SparkDeploySchedulerBackend: Granted executor ID
> app-20140807171444-0002/11 on hostPort 172.28.173.218:7778 with 6 cores,
> 16.0 GB RAM
> 14/08/07 17:18:07 INFO AppClient$ClientActor: Executor updated:
> app-20140807171444-0002/6 is now EXITED (Command exited with code 1)
> 14/08/07 17:18:07 INFO SparkDeploySchedulerBackend: Executor
> app-20140807171444-0002/6 removed: Command exited with code 1
> 14/08/07 17:18:07 INFO AppClient$ClientActor: Executor added:
> app-20140807171444-0002/12 on worker-20140807155724-172.22.56.186-7778
> (172.22.56.186:7778) with 6 cores
> 14/08/07 17:18:07 INFO SparkDeploySchedulerBackend: Granted executor ID
> app-20140807171444-0002/12 on hostPort 172.22.56.186:7778 with 6 cores, 16.0
> GB RAM
> 14/08/07 17:18:07 INFO AppClient$ClientActor: Executor updated:
> app-20140807171444-0002/10 is now RUNNING
> 14/08/07 17:18:07 INFO AppClient$ClientActor: Executor updated:
> app-20140807171444-0002/11 is now RUNNING
> 14/08/07 17:18:07 INFO AppClient$ClientActor: Executor updated:
> app-20140807171444-0002/12 is now RUNNING
> 14/08/07 17:18:07 INFO AppClient$ClientActor: Executor updated:
> app-20140807171444-0002/8 is now EXITED (Command exited with code 1)
> 14/08/07 17:18:07 INFO SparkDeploySchedulerBackend: Executor
> app-20140807171444-0002/8 removed: Command exited with code 1
> 14/08/07 17:18:07 INFO AppClient$ClientActor: Executor added:
> app-20140807171444-0002/13 on worker-20140807155724-172.23.64.98-7778
> (172.23.64.98:7778) with 6 cores
> 14/08/07 17:18:07 INFO SparkDeploySchedulerBackend: Granted executor ID
> app-20140807171444-0002/13 on hostPort 172.23.64.98:7778 with 6 cores, 16.0
> GB RAM
> 14/08/07 17:18:07 INFO AppClient$ClientActor: Executor updated:
> app-20140807171444-0002/13 is now RUNNING
> 14/08/07 17:18:08 INFO AppClient$ClientActor: Executor updated:
> app-20140807171444-0002/9 is now EXITED (Command exited with code 1)
> 14/08/07 17:18:08 INFO SparkDeploySchedulerBackend: Executor
> app-20140807171444-0002/9 removed: Command exited with code 1
> 14/08/07 17:18:08 ERROR SparkDeploySchedulerBackend: Application has been
> killed. Reason: Master removed our application: FAILED
> 14/08/07 17:18:08 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks
> have all completed, from pool
> 14/08/07 17:18:08 INFO TaskSchedulerImpl: Cancelling stage 0
> 14/08/07 17:18:08 INFO DAGScheduler: Failed to run count at <console>:15
> 14/08/07 17:18:08 INFO SparkUI: Stopped Spark web UI at
> http://redis-1-prod.adyoulike.net:4040
> 14/08/07 17:18:08 INFO DAGScheduler: Stopping DAGScheduler
> 14/08/07 17:18:08 INFO SparkDeploySchedulerBackend: Shutting down all
> executors
> 14/08/07 17:18:08 INFO SparkDeploySchedulerBackend: Asking each executor to
> shut down
> org.apache.spark.SparkException: Job aborted due to stage failure: Master
> removed our application: FAILED
>     at
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049)
>     at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033)
>     at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031)
>     at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>     at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>     at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031)
>     at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
>     at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
>     at scala.Option.foreach(Option.scala:236)
>     at
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:635)
>     at
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1234)
>     at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>     at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>     at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>     at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>     at
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>     at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>     at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>     at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>     at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>
>
> scala> 14/08/07 17:18:18 WARN TaskSchedulerImpl: Initial job has not
> accepted any resources; check your cluster UI to ensure that workers are
> registered and have sufficient memory
>
>
> scala>
>
> scala> 14/08/07 17:18:33 WARN TaskSchedulerImpl: Initial job has not
> accepted any resources; check your cluster UI to ensure that workers are
> registered and have sufficient memory
> 14/08/07 17:18:38 INFO AppClient: Stop request to Master timed out; it may
> already be shut down.
> 14/08/07 17:18:39 INFO MapOutputTrackerMasterActor: MapOutputTrackerActor
> stopped!
> 14/08/07 17:18:39 INFO ConnectionManager: Selector thread was interrupted!
> 14/08/07 17:18:39 INFO ConnectionManager: ConnectionManager stopped
> 14/08/07 17:18:39 INFO MemoryStore: MemoryStore cleared
> 14/08/07 17:18:39 INFO BlockManager: BlockManager stopped
> 14/08/07 17:18:39 INFO BlockManagerMasterActor: Stopping BlockManagerMaster
> 14/08/07 17:18:39 INFO BlockManagerMaster: BlockManagerMaster stopped
> 14/08/07 17:18:39 INFO SparkContext: Successfully stopped SparkContext
> 14/08/07 17:18:39 INFO RemoteActorRefProvider$RemotingTerminator: Shutting
> down remote daemon.
> 14/08/07 17:18:39 INFO RemoteActorRefProvider$RemotingTerminator: Remote
> daemon shut down; proceeding with flushing remote transports.
> 14/08/07 17:18:39 INFO Remoting: Remoting shut down
> 14/08/07 17:18:39 INFO RemoteActorRefProvider$RemotingTerminator: Remoting
> shut down.
>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Initial-job-has-not-accepted-any-resources-tp11668.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>



-- 
Marcelo

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message