Gotcha. The easiest way to get your dependencies to your Executors would probably be to construct your SparkContext with all necessary jars passed in (as the "jars" parameter), or inside a SparkConf with setJars(). Avro is a "necessary jar", but it's possible your application also needs to distribute other ones to the cluster.

An easy way to make sure all your dependencies get shipped to the cluster is to create an assembly jar of your application, and then you just need to tell Spark about that jar, which includes all your application's transitive dependencies. Maven and sbt both have pretty straightforward ways of producing assembly jars.


On Sat, May 31, 2014 at 11:23 PM, Russell Jurney <russell.jurney@gmail.com> wrote:
Thanks for the fast reply.

I am running CDH 4.4 with the Cloudera Parcel of Spark 0.9.0, in standalone mode.


On Saturday, May 31, 2014, Aaron Davidson <ilikerps@gmail.com> wrote:
First issue was because your cluster was configured incorrectly. You could probably read 1 file because that was done on the driver node, but when it tried to run a job on the cluster, it failed.

Second issue, it seems that the jar containing avro is not getting propagated to the Executors. What version of Spark are you running on? What deployment mode (YARN, standalone, Mesos)?


On Sat, May 31, 2014 at 9:37 PM, Russell Jurney <russell.jurney@gmail.com> wrote:
Now I get this:

scala> rdd.first

14/05/31 21:36:28 INFO spark.SparkContext: Starting job: first at <console>:41

14/05/31 21:36:28 INFO scheduler.DAGScheduler: Got job 4 (first at <console>:41) with 1 output partitions (allowLocal=true)

14/05/31 21:36:28 INFO scheduler.DAGScheduler: Final stage: Stage 4 (first at <console>:41)

14/05/31 21:36:28 INFO scheduler.DAGScheduler: Parents of final stage: List()

14/05/31 21:36:28 INFO scheduler.DAGScheduler: Missing parents: List()

14/05/31 21:36:28 INFO scheduler.DAGScheduler: Computing the requested partition locally

14/05/31 21:36:28 INFO rdd.HadoopRDD: Input split: hdfs://hivecluster2/securityx/web_proxy_mef/2014/05/29/22/part-m-00000.avro:0+3864

14/05/31 21:36:28 INFO spark.SparkContext: Job finished: first at <console>:41, took 0.037371256 s

14/05/31 21:36:28 INFO spark.SparkContext: Starting job: first at <console>:41

14/05/31 21:36:28 INFO scheduler.DAGScheduler: Got job 5 (first at <console>:41) with 16 output partitions (allowLocal=true)

14/05/31 21:36:28 INFO scheduler.DAGScheduler: Final stage: Stage 5 (first at <console>:41)

14/05/31 21:36:28 INFO scheduler.DAGScheduler: Parents of final stage: List()

14/05/31 21:36:28 INFO scheduler.DAGScheduler: Missing parents: List()

14/05/31 21:36:28 INFO scheduler.DAGScheduler: Submitting Stage 5 (HadoopRDD[0] at hadoopRDD at <console>:37), which has no missing parents

14/05/31 21:36:28 INFO scheduler.DAGScheduler: Submitting 16 missing tasks from Stage 5 (HadoopRDD[0] at hadoopRDD at <console>:37)

14/05/31 21:36:28 INFO scheduler.TaskSchedulerImpl: Adding task set 5.0 with 16 tasks

14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:0 as TID 92 on executor 2: hivecluster3 (NODE_LOCAL)

14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:0 as 1294 bytes in 1 ms

14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:3 as TID 93 on executor 1: hivecluster5.labs.lan (NODE_LOCAL)

14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:3 as 1294 bytes in 0 ms

14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:1 as TID 94 on executor 4: hivecluster4 (NODE_LOCAL)

14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:1 as 1294 bytes in 1 ms

14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:2 as TID 95 on executor 0: hivecluster6.labs.lan (NODE_LOCAL)

14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:2 as 1294 bytes in 0 ms

14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:4 as TID 96 on executor 3: hivecluster1.labs.lan (NODE_LOCAL)

14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:4 as 1294 bytes in 0 ms

14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:6 as TID 97 on executor 2: hivecluster3 (NODE_LOCAL)

14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:6 as 1294 bytes in 0 ms

14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:5 as TID 98 on executor 1: hivecluster5.labs.lan (NODE_LOCAL)

14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:5 as 1294 bytes in 0 ms

14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:8 as TID 99 on executor 4: hivecluster4 (NODE_LOCAL)

14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:8 as 1294 bytes in 0 ms

14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:7 as TID 100 on executor 0: hivecluster6.labs.lan (NODE_LOCAL)

14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:7 as 1294 bytes in 0 ms

14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:10 as TID 101 on executor 3: hivecluster1.labs.lan (NODE_LOCAL)

14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:10 as 1294 bytes in 0 ms

14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:14 as TID 102 on executor 2: hivecluster3 (NODE_LOCAL)

14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:14 as 1294 bytes in 0 ms

14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:9 as TID 103 on executor 1: hivecluster5.labs.lan (NODE_LOCAL)

14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:9 as 1294 bytes in 0 ms

14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:11 as TID 104 on executor 4: hivecluster4 (N