That error typically means that there is a communication error (wrong ports) between master and worker. Also check if the worker has "write" permissions to create the "work" directory. We were getting this error due one of the above two reasons



On Tue, Jun 17, 2014 at 10:04 AM, Luis Ángel Vicente Sánchez <langel.groups@gmail.com> wrote:
I have been able to submit a job successfully but I had to config my spark job this way:

  val sparkConf: SparkConf =
    new SparkConf()
      .setAppName("TwitterPopularTags")
      .setMaster("spark://int-spark-master:7077")
      .setSparkHome("/opt/spark")
      .setJars(Seq("/tmp/spark-test-0.1-SNAPSHOT.jar"))

Now I'm getting this error on my worker:

4/06/17 17:03:40 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory




2014-06-17 17:36 GMT+01:00 Luis Ángel Vicente Sánchez <langel.groups@gmail.com>:

Ok... I was checking the wrong version of that file yesterday. My worker is sending a DriverStateChanged(_, DriverState.FAILED, _) but there is no case branch for that state and the worker is crashing. I still don't know why I'm getting a FAILED state but I'm sure that should kill the actor due to a scala.MatchError.

Usually in scala is a best-practice to use a sealed trait and case classes/objects in a match statement instead of an enumeration (the compiler will complain about missing cases); I think that should be refactored to catch this kind of errors at compile time.

Now I need to find why that state changed message is sent... I will continue updating this thread until I found the problem :D


2014-06-16 18:25 GMT+01:00 Luis Ángel Vicente Sánchez <langel.groups@gmail.com>:

I'm playing with a modified version of the TwitterPopularTags example and when I tried to submit the job to my cluster, workers keep dying with this message:

14/06/16 17:11:16 INFO DriverRunner: Launch Command: "java" "-cp" "/opt/spark-1.0.0-bin-hadoop1/work/driver-20140616171115-0014/spark-test-0.1-SNAPSHOT.jar:::/opt/spark-1.0.0-bin-hadoop1/conf:/opt/spark-1.0.0-bin-hadoop1/lib/spark-assembly-1.0.0-hadoop1.0.4.jar" "-XX:MaxPermSize=128m" "-Xms512M" "-Xmx512M" "org.apache.spark.deploy.worker.DriverWrapper" "akka.tcp://sparkWorker@int-spark-worker:51676/user/Worker" "org.apache.spark.examples.streaming.TwitterPopularTags"
14/06/16 17:11:17 ERROR OneForOneStrategy: FAILED (of class scala.Enumeration$Val)
scala.MatchError: FAILED (of class scala.Enumeration$Val)
at org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:317)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
14/06/16 17:11:17 INFO Worker: Starting Spark worker int-spark-app-ie005d6a3.mclabs.io:51676 with 2 cores, 6.5 GB RAM
14/06/16 17:11:17 INFO Worker: Spark home: /opt/spark-1.0.0-bin-hadoop1
14/06/16 17:11:17 INFO WorkerWebUI: Started WorkerWebUI at http://int-spark-app-ie005d6a3.mclabs.io:8081
14/06/16 17:11:17 INFO Worker: Connecting to master spark://int-spark-app-ie005d6a3.mclabs.io:7077...
14/06/16 17:11:17 ERROR Worker: Worker registration failed: Attempted to re-register worker at same address: akka.tcp://sparkWorker@int-spark-app-ie005d6a3.mclabs.io:51676

This happens when the worker receive a DriverStateChanged(driverId, state, exception) message. 

To deploy the job I copied the jar file to the temporary folder of master node and execute the following command:

./spark-submit \
--class org.apache.spark.examples.streaming.TwitterPopularTags \
--master spark://int-spark-master:7077 \
--deploy-mode cluster \
file:///tmp/spark-test-0.1-SNAPSHOT.jar

I don't really know what the problem could be as there is a 'case _' that should avoid that problem :S





--
Software Engineer
Analytics Engineering Team@ Box
Mountain View, CA