spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jay Vyas <jayunit100.apa...@gmail.com>
Subject Re: Error with spark-submit (formatting corrected)
Date Fri, 18 Jul 2014 02:20:55 GMT
I think I know what is happening to you.  I've looked some into this just this week, and so
its fresh in my brain :) hope this helps.


When no workers are known to the master, iirc, you get this message.

I think  this is how it works.

1) You start your master
2) You start a slave, and give it master url as an argument.
3) The slave then binds to a random port
4) The slave then does a handshake with master, which you can see in the slave logs (it sais
something like "sucesfully connected to master at …".
  Actualy, i think tha master also logs that it now is aware of a slave running on ip:port…

So in your case, I suspect, none of the slaves have connected to the master, so the job sits
idle.

This is similar to the yarn scenario of submitting a job to a resource manager with no node-managers
running. 



On Jul 17, 2014, at 6:57 PM, ranjanp <piyush_ranjan@hotmail.com> wrote:

> Hi, 
> I am new to Spark and trying out with a stand-alone, 3-node (1 master, 2
> workers) cluster. 
> 
> From the Web UI at the master, I see that the workers are registered. But
> when I try running the SparkPi example from the master node, I get the
> following message and then an exception. 
> 
> 14/07/17 01:20:36 INFO AppClient$ClientActor: Connecting to master
> spark://10.1.3.7:7077... 
> 14/07/17 01:20:46 WARN TaskSchedulerImpl: Initial job has not accepted any
> resources; check your cluster UI to ensure that workers are registered and
> have sufficient memory 
> 
> I searched a bit for the above warning, and found and found that others have
> encountered this problem before, but did not see a clear resolution except
> for this link:
> http://apache-spark-user-list.1001560.n3.nabble.com/TaskSchedulerImpl-Initial-job-has-not-accepted-any-resources-check-your-cluster-UI-to-ensure-that-woy-tt8247.html#a8444
> 
> Based on the suggestion there I tried supplying --executor-memory option to
> spark-submit but that did not help. 
> 
> Any suggestions. Here are the details of my set up. 
> - 3 nodes (each with 4 CPU cores and 7 GB memory) 
> - 1 node configured as Master, and the other two configured as workers 
> - Firewall is disabled on all nodes, and network communication between the
> nodes is not a problem 
> - Edited the conf/spark-env.sh on all nodes to set the following: 
>  SPARK_WORKER_CORES=3 
>  SPARK_WORKER_MEMORY=5G 
> - The Web UI as well as logs on master show that Workers were able to
> register correctly. Also the Web UI correctly shows the aggregate available
> memory and CPU cores on the workers: 
> 
> URL: spark://vmsparkwin1:7077
> Workers: 2
> Cores: 6 Total, 0 Used
> Memory: 10.0 GB Total, 0.0 B Used
> Applications: 0 Running, 0 Completed
> Drivers: 0 Running, 0 Completed
> Status: ALIVE
> 
> I try running the SparkPi example first using the run-example (which was
> failing) and later directly using the spark-submit as shown below: 
> 
> $ export MASTER=spark://vmsparkwin1:7077
> 
> $ echo $MASTER
> spark://vmsparkwin1:7077
> 
> azureuser@vmsparkwin1 /cygdrive/c/opt/spark-1.0.0
> $ ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master
> spark://10.1.3.7:7077 --executor-memory 1G --total-executor-cores 2
> ./lib/spark-examples-1.0.0-hadoop2.2.0.jar 10
> 
> 
> The following is the full screen output:
> 
> 14/07/17 01:20:13 INFO SecurityManager: Using Spark's default log4j profile:
> org/apache/spark/log4j-defaults.properties
> 14/07/17 01:20:13 INFO SecurityManager: Changing view acls to: azureuser
> 14/07/17 01:20:13 INFO SecurityManager: SecurityManager: authentication
> disabled; ui acls disabled; users with view permissions: Set(azureuser)
> 14/07/17 01:20:14 INFO Slf4jLogger: Slf4jLogger started
> 14/07/17 01:20:14 INFO Remoting: Starting remoting
> 14/07/17 01:20:14 INFO Remoting: Remoting started; listening on addresses
> :[akka.tcp://spark@vmsparkwin1.cssparkwin.b1.internal.cloudapp.net:49839]
> 14/07/17 01:20:14 INFO Remoting: Remoting now listens on addresses:
> [akka.tcp://spark@vmsparkwin1.cssparkwin.b1.internal.cloudapp.net:49839]
> 14/07/17 01:20:14 INFO SparkEnv: Registering MapOutputTracker
> 14/07/17 01:20:14 INFO SparkEnv: Registering BlockManagerMaster
> 14/07/17 01:20:14 INFO DiskBlockManager: Created local directory at
> C:\cygwin\tmp\spark-local-20140717012014-b606
> 14/07/17 01:20:14 INFO MemoryStore: MemoryStore started with capacity 294.9
> MB.
> 14/07/17 01:20:14 INFO ConnectionManager: Bound socket to port 49842 with id
> = ConnectionManagerId(vmsparkwin1.cssparkwin.b1.internal.cloudapp.net,49842)
> 14/07/17 01:20:14 INFO BlockManagerMaster: Trying to register BlockManager
> 14/07/17 01:20:14 INFO BlockManagerInfo: Registering block manager
> vmsparkwin1.cssparkwin.b1.internal.cloudapp.net:49842 with 294.9 MB RAM
> 14/07/17 01:20:14 INFO BlockManagerMaster: Registered BlockManager
> 14/07/17 01:20:14 INFO HttpServer: Starting HTTP Server
> 14/07/17 01:20:14 INFO HttpBroadcast: Broadcast server started at
> http://10.1.3.7:49843
> 14/07/17 01:20:14 INFO HttpFileServer: HTTP File server directory is
> C:\cygwin\tmp\spark-6a076e92-53bb-4c7a-9e27-ce53a818146d
> 14/07/17 01:20:14 INFO HttpServer: Starting HTTP Server
> 14/07/17 01:20:15 INFO SparkUI: Started SparkUI at
> http://vmsparkwin1.cssparkwin.b1.internal.cloudapp.net:4040
> 14/07/17 01:20:15 WARN NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
> 14/07/17 01:20:16 INFO SparkContext: Added JAR
> file:/C:/opt/spark-1.0.0/./lib/spark-examples-1.0.0-hadoop2.2.0.jar at
> http://10.1.3.7:49844/jars/spark-examples-1.0.0-hadoop2.2.0.jar with
> timestamp 1405560016316
> 14/07/17 01:20:16 INFO AppClient$ClientActor: Connecting to master
> spark://10.1.3.7:7077...
> 14/07/17 01:20:16 INFO SparkContext: Starting job: reduce at
> SparkPi.scala:35
> 14/07/17 01:20:16 INFO DAGScheduler: Got job 0 (reduce at SparkPi.scala:35)
> with 10 output partitions (allowLocal=false)
> 14/07/17 01:20:16 INFO DAGScheduler: Final stage: Stage 0(reduce at
> SparkPi.scala:35)
> 14/07/17 01:20:16 INFO DAGScheduler: Parents of final stage: List()
> 14/07/17 01:20:16 INFO DAGScheduler: Missing parents: List()
> 14/07/17 01:20:16 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[1] at map
> at SparkPi.scala:31), which has no missing parents
> 14/07/17 01:20:16 INFO DAGScheduler: Submitting 10 missing tasks from Stage
> 0 (MappedRDD[1] at map at SparkPi.scala:31)
> 14/07/17 01:20:16 INFO TaskSchedulerImpl: Adding task set 0.0 with 10 tasks
> 14/07/17 01:20:31 WARN TaskSchedulerImpl: Initial job has not accepted any
> resources; check your cluster UI to ensure that workers are registered and
> have sufficient memory
> 14/07/17 01:20:36 INFO AppClient$ClientActor: Connecting to master
> spark://10.1.3.7:7077...
> 14/07/17 01:20:46 WARN TaskSchedulerImpl: Initial job has not accepted any
> resources; check your cluster UI to ensure that workers are registered and
> have sufficient memory
> 14/07/17 01:20:56 INFO AppClient$ClientActor: Connecting to master
> spark://10.1.3.7:7077...
> 14/07/17 01:21:01 WARN TaskSchedulerImpl: Initial job has not accepted any
> resources; check your cluster UI to ensure that workers are registered and
> have sufficient memory
> 14/07/17 01:21:16 ERROR SparkDeploySchedulerBackend: Application has been
> killed. Reason: All masters are unresponsive! Giving up.
> 14/07/17 01:21:16 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks
> have all completed, from pool
> 14/07/17 01:21:16 INFO TaskSchedulerImpl: Cancelling stage 0
> 14/07/17 01:21:16 INFO DAGScheduler: Failed to run reduce at
> SparkPi.scala:35
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due
> to stage failure: All masters are unresponsive! Giving up.
>        at
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
>        at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
>        at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
>        at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>        at
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>        at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
>        at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
>        at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
>        at scala.Option.foreach(Option.scala:236)
>        at
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633)
>        at
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1207)
>        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>        at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>        at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>        at
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>        at
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>        at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>        at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>        at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> 
> 
> 
> 
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Error-with-spark-submit-formatting-corrected-tp10102.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.


Mime
View raw message