Alonso,

 

The CDH VM uses YARN and the default deploy mode is client. I’ve been able to use the CDH VM for many learning scenarios.

 

 

http://www.cloudera.com/documentation/enterprise/latest.html

http://www.cloudera.com/documentation/enterprise/latest/topics/spark.html

 

David Newberger

 

From: Alonso [mailto:alonsoir@gmail.com]
Sent: Friday, June 3, 2016 5:39 AM
To: user@spark.apache.org
Subject: About a problem running a spark job in a cdh-5.7.0 vmware image.

 

Hi, i am developing a project that needs to use kafka, spark-streaming and spark-mllib, this is the github project

 

I am using a vmware cdh-5.7-0 image, with 4 cores and 8 GB of ram, the file that i want to use is only 16 MB, if i finding problems related with resources because the process outputs this message:

 

                                   .set("spark.driver.allowMultipleContexts", "true")


16/06/03 11:58:09 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

 

 

when i go to spark-master page, i can see this:

 

 

Spark Master at spark://192.168.30.137:7077

 

    URL: spark://192.168.30.137:7077

    REST URL: spark://192.168.30.137:6066 (cluster mode)

    Alive Workers: 0

    Cores in use: 0 Total, 0 Used

    Memory in use: 0.0 B Total, 0.0 B Used

    Applications: 2 Running, 0 Completed

    Drivers: 0 Running, 0 Completed

    Status: ALIVE

 

Workers

Worker Id Address State Cores Memory

Running Applications

Application ID Name Cores Memory per Node Submitted Time User State Duration

app-20160603115752-0001

(kill)

AmazonKafkaConnector 0 1024.0 MB 2016/06/03 11:57:52 cloudera WAITING 2.0 min

app-20160603115751-0000

(kill)

AmazonKafkaConnector 0 1024.0 MB 2016/06/03 11:57:51 cloudera WAITING 2.0 min

 

 

And this is the spark-worker output:

 

Spark Worker at 192.168.30.137:7078

 

    ID: worker-20160603115937-192.168.30.137-7078

    Master URL:

    Cores: 4 (0 Used)

    Memory: 6.7 GB (0.0 B Used)

 

Back to Master

Running Executors (0)

ExecutorID Cores State Memory Job Details Logs

 

It is weird isn't ? master url is not set up and there is not any ExecutorID, Cores, so on so forth...

 

If i do a ps xa | grep spark, this is the output:

 

[cloudera@quickstart bin]$ ps xa | grep spark

 6330 ?        Sl     0:11 /usr/java/jdk1.7.0_67-cloudera/bin/java -cp /usr/lib/spark/conf/:/usr/lib/spark/lib/spark-assembly-1.6.0-cdh5.7.0-hadoop2.6.0-cdh5.7.0.jar:/etc/hadoop/conf/:/usr/lib/spark/lib/spark-assembly.jar:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/*:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-hdfs/*:/usr/lib/hadoop-mapreduce/lib/*:/usr/lib/hadoop-mapreduce/*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-yarn/*:/usr/lib/hive/lib/*:/usr/lib/flume-ng/lib/*:/usr/lib/paquet/lib/*:/usr/lib/avro/lib/* -Dspark.deploy.defaultCores=4 -Xms1g -Xmx1g -XX:MaxPermSize=256m org.apache.spark.deploy.master.Master

 

 6674 ?        Sl     0:12 /usr/java/jdk1.7.0_67-cloudera/bin/java -cp /etc/spark/conf/:/usr/lib/spark/lib/spark-assembly-1.6.0-cdh5.7.0-hadoop2.6.0-cdh5.7.0.jar:/etc/hadoop/conf/:/usr/lib/spark/lib/spark-assembly.jar:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/*:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-hdfs/*:/usr/lib/hadoop-mapreduce/lib/*:/usr/lib/hadoop-mapreduce/*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-yarn/*:/usr/lib/hive/lib/*:/usr/lib/flume-ng/lib/*:/usr/lib/paquet/lib/*:/usr/lib/avro/lib/* -Dspark.history.fs.logDirectory=hdfs:///user/spark/applicationHistory -Dspark.history.ui.port=18088 -Xms1g -Xmx1g -XX:MaxPermSize=256m org.apache.spark.deploy.history.HistoryServer

 

 8153 pts/1    Sl+    0:14 /usr/java/jdk1.7.0_67-cloudera/bin/java -cp /home/cloudera/awesome-recommendation-engine/target/pack/lib/* -Dprog.home=/home/cloudera/awesome-recommendation-engine/target/pack -Dprog.version=1.0-SNAPSHOT example.spark.AmazonKafkaConnector 192.168.1.35:9092 amazonRatingsTopic

 

 8413 ?        Sl     0:04 /usr/java/jdk1.7.0_67-cloudera/bin/java -cp /usr/lib/spark/conf/:/usr/lib/spark/lib/spark-assembly-1.6.0-cdh5.7.0-hadoop2.6.0-cdh5.7.0.jar:/etc/hadoop/conf/:/usr/lib/spark/lib/spark-assembly.jar:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/*:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-hdfs/*:/usr/lib/hadoop-mapreduce/lib/*:/usr/lib/hadoop-mapreduce/*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-yarn/*:/usr/lib/hive/lib/*:/usr/lib/flume-ng/lib/*:/usr/lib/paquet/lib/*:/usr/lib/avro/lib/* -Xms1g -Xmx1g -XX:MaxPermSize=256m org.apache.spark.deploy.worker.Worker spark://quickstart.cloudera:7077

 

 8619 pts/3    S+     0:00 grep spark

 

master is set up with four cores and 1 GB and worker has not any dedicated core and it is using 1GB, that is weird isn't ? I have configured the vmware image with 4 cores (from eight) and 8 GB (from 16). 

 

This is how it looks my build.sbt:

 

libraryDependencies ++= Seq(

  "org.apache.kafka" % "kafka_2.10" % "0.8.1"

      exclude("javax.jms", "jms")

      exclude("com.sun.jdmk", "jmxtools")

      exclude("com.sun.jmx", "jmxri"),

   //not working play module!! check

   //jdbc,

   //anorm,

   //cache,

   // HTTP client

   "net.databinder.dispatch" %% "dispatch-core" % "0.11.1",

   // HTML parser

   "org.jodd" % "jodd-lagarto" % "3.5.2",

   "com.typesafe" % "config" % "1.2.1",

   "com.typesafe.play" % "play-json_2.10" % "2.4.0-M2",

   "org.scalatest" % "scalatest_2.10" % "2.2.1" % "test",

   "org.twitter4j" % "twitter4j-core" % "4.0.2",

   "org.twitter4j" % "twitter4j-stream" % "4.0.2",

   "org.codehaus.jackson" % "jackson-core-asl" % "1.6.1",

   "org.scala-tools.testing" % "specs_2.8.0" % "1.6.5" % "test",

   "org.apache.spark" % "spark-streaming-kafka_2.10" % "1.6.0-cdh5.7.0",

   "org.apache.spark" % "spark-core_2.10" % "1.6.0-cdh5.7.0",

   "org.apache.spark" % "spark-streaming_2.10" % "1.6.0-cdh5.7.0",

   "org.apache.spark" % "spark-sql_2.10" % "1.6.0-cdh5.7.0",

   "org.apache.spark" % "spark-mllib_2.10" % "1.6.0-cdh5.7.0",

   "com.google.code.gson" % "gson" % "2.6.2",

   "commons-cli" % "commons-cli" % "1.3.1",

   "com.stratio.datasource" % "spark-mongodb_2.10" % "0.11.1",

   // Akka

   "com.typesafe.akka" %% "akka-actor" % akkaVersion,

   "com.typesafe.akka" %% "akka-slf4j" % akkaVersion,

   // MongoDB

   "org.reactivemongo" %% "reactivemongo" % "0.10.0"

)

 

packAutoSettings

 

As you can see, i am using the exact version of spark modules for the pseudo cluster and i want to use sbt-pack in order to create 

an unix command, this is how i am declaring programmatically the spark context :

 

 

val sparkConf = new SparkConf().setAppName("AmazonKafkaConnector")

                                   //.setMaster("local[4]")

                                   .setMaster("spark://192.168.30.137:7077")

                                   .set("spark.cores.max", "2")

 

...

 

val ratingFile= "hdfs://192.168.30.137:8020/user/cloudera/ratings.csv"

 

 

println("Using this ratingFile: " + ratingFile)

  // first create an RDD out of the rating file

  val rawTrainingRatings = sc.textFile(ratingFile).map {

    line =>

      val Array(userId, productId, scoreStr) = line.split(",")

      AmazonRating(userId, productId, scoreStr.toDouble)

  }

 

  // only keep users that have rated between MinRecommendationsPerUser and MaxRecommendationsPerUser products

 

 

//THIS IS THE LINE THAT PROVOKES the 

WARN TaskSchedulerImp

!

   

val trainingRatings = rawTrainingRatings.groupBy(_.userId)

                                          .filter(r => MinRecommendationsPerUser <= r._2.size  && r._2.size < MaxRecommendationsPerUser)

                                          .flatMap(_._2)

                                          .repartition(NumPartitions)

                                          .cache()

 

  println(s"Parsed $ratingFile. Kept ${trainingRatings.count()} ratings out of ${rawTrainingRatings.count()}")

 

My question is, do you see anything wrong with the code? is there anything terrible wrong that i have to change? and, 

what can i do to have this up and running with my resources? 

 

What most annoys me is that the above code works perfectly in the console spark of the virtual image but when I try to make it run 

programmatically creating the unix with SBT-pack command does not work.

 

If the dedicated resources are too few to develop this project, what else can i do? i mean, do i need to hire a tiny cluster with AWS 

or any another provider? if that is a correct answer, which are yours recommendation?

Thank you very much for reading until here.

 

Regards,

 

Alonso 

 

 


View this message in context: About a problem running a spark job in a cdh-5.7.0 vmware image.
Sent from the Apache Spark User List mailing list archive at Nabble.com.