spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Nist <tsind...@gmail.com>
Subject Re: Set EXTRA_JAR environment variable for spark-jobserver
Date Tue, 06 Jan 2015 18:52:17 GMT
*@Sasi*

You should be able to create a job something like this:

package io.radtech.spark.jobserver

import java.util.UUID
import org.apache.spark.{ SparkConf, SparkContext }
import org.apache.spark.rdd.RDD
import org.joda.time.DateTime
import com.datastax.spark.connector.types.TypeConverter
import com.datastax.spark.connector.types.TypeConversionException
import com.typesafe.config.Config

case class AnalyticReport(
  deviceId: UUID,
  reportType: String,
  timestamp: DateTime,
  data: Map[String, String])

class ReadWriteCassandraJob {
}

trait AlycsReportSparkJob extends spark.jobserver.SparkJob with
    spark.jobserver.NamedRddSupport {

  val rddName = "report"

  // Validation is not really needed in this example
  def validate(sc: SparkContext, config: Config):
spark.jobserver.SparkJobValidation = spark.jobserver.SparkJobValid
}

object ReadWriteCassandraJob extends AlycsReportSparkJob {

  val cassandraHost = "127.0.0.1"
  val keyspace = "test"
  val table = "alycs_reports_by_device"

  /*
   * Enable Cassandra-specific functions on the `SparkContext` and `RDD`:
   */
  import com.datastax.spark.connector._

  /*
   * Before creating the `SparkContext`, set the
`spark.cassandra.connection.host`
   * property to the address of one of the Cassandra nodes.
   */
  val conf = new SparkConf(true).set("spark.cassandra.connection.host",
cassandraHost)

  /*
   * Set the port to connect to.  If using embedded instance set to 9142
else
   * default to 9042.
   */
  conf.set("spark.cassandra.connection.native.port", "9042")


  override def runJob(sc: SparkContext, config: Config) = {
    // Read table test.alycs_reports_by_device and print its contents:

    val rdd = sc.cassandraTable(keyspace, table).select(
      "device_id", "report_type", "time", "data")

    rdd.collect().foreach(println)

    val rddrows = rdd.map(r =>
      AnalyticReport(UUID.fromString(r.getUUID("device_id").toString()),
        r.getString("report_type"),
        new org.joda.time.DateTime(r.getDate("time")),
        r.getMap[String, String]("data")))

    rddrows.collect.foreach(println)

  }

}

Then create a custom spark context file,
src/main/resources/spark.context-settings.config, for the job; note the
versions of the jars are incorrect below, don't have the latest ones off
the top of my head.  If you are using the uber / fat jar from
spark-cassandra-connector then simple place that here instead, i believe
the name is:  spark-cassandra-connector-assembly-1.1.0.jar.

spark.context-settings {
    spark.cores.max = 4
    spark.cassandra.connection.host 127.0.0.1
    dependent-jar-uris = [
      "local://sparkshell-lib/spark-cassandra-connector_2.10-1.0.0-rc4.jar",
      "local://sparkshell-lib/cassandra-clientutil-2.0.9.jar",
      "local://sparkshell-lib/cassandra-thrift-2.0.9.jar",
      "local://sparkshell-lib/cassandra-driver-core-2.0.4.jar",
      "local://sparkshell-lib/guava-15.0.jar",
      "local://sparkshell-lib/libthrift-0.9.1.jar",
      "local://sparkshell-lib/joda-convert-1.2.jar",
      "local://sparkshell-lib/joda-time-2.3.jar"
    ]
}

Now post the context to the job server:

radtech:spark-jobserver-example$ curl -d
src/main/resources/spark.context-settings.config -X POST
'localhost:8090/contexts/cassJob-context'



Then execute your job:

curl --data-binary
@target/scala-2.10/spark-jobserver-example_2.10-0.1.0.jar
localhost:8090/jars/cassjob
curl -X POST 'localhost:8090/jobs?appName=cassjob&classPath=io.radtech.spark.jobserver.ReadWriteCassandraJob&context=cassJob-context'

Worse case you should be able to set these in your spark-defatul.conf to a
location that is common to all your executors:

spark.executor.extraClassPath=.....

HTH.

-Todd

On Tue, Jan 6, 2015 at 10:00 AM, bchazalet <bchazalet@companywatch.net>
wrote:

> It does not look like you're supposed to fiddle with the SparkConf and even
> SparkContext in a 'job' (again, I don't know much about jobserver), as
> you're given a SparkContext as parameter in the build method.
>
> I guess jobserver initialises the SparkConf and SparkContext itself when it
> first starts, meanwhile you're actually creating a new one within your job,
> which the github example you mentionned doesn't do, it just uses the
> context
> given as parameter:
>
> def build(sc: SparkContext): RDD[(Reputation, User)] = {
>     sc.textFile(inputPath).
>       map(User.fromRow).
>       collect {
>         case Some(user) => user.reputation -> user
>       }.
>       sortByKey(ascending = false)
>   }
>
> I am not sure either how you upload your job's jar to the server (the curl
> command you posted does not seem to do so).
>
> Maybe you could try first to make it work on its own as a regular spark
> app,
> without using jobserver.
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Set-EXTRA-JAR-environment-variable-for-spark-jobserver-tp20989p20998.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Mime
View raw message