spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Davidson <ilike...@gmail.com>
Subject Re: hadoopRDD stalls reading entire directory
Date Mon, 02 Jun 2014 21:57:14 GMT
Ah, I apologize! I didn't realize you were running from the spark-shell.
The shell has already created its own SparkContext, so you can just do

sc.addJar("avro-1.7.6.jar")
sc.addJar("avro-mapred-1.7.6.jar")

The previous instructions would have worked if you were running your own
Spark application where you control the creation of the SparkContext.


On Mon, Jun 2, 2014 at 2:02 PM, Russell Jurney <russell.jurney@gmail.com>
wrote:

> Nothing appears to be running on hivecluster2:8080.
>
> 'sudo jps' does show
>
> [hivedata@hivecluster2 ~]$ sudo jps
> 9953 PepAgent
> 13797 JournalNode
> 7618 NameNode
> 6574 Jps
> 12716 Worker
> 16671 RunJar
> 18675 Main
> 18177 JobTracker
> 10918 Master
> 18139 TaskTracker
> 7674 DataNode
>
>
> I kill all processes listed. I restart Spark Master on hivecluster2:
>
> [hivedata@hivecluster2 ~]$ sudo
> /opt/cloudera/parcels/SPARK/lib/spark/sbin/start-master.sh
>
> starting org.apache.spark.deploy.master.Master, logging to
> /var/log/spark/spark-root-org.apache.spark.deploy.master.Master-1-hivecluster2.out
>
> I run the spark shell again:
>
> [hivedata@hivecluster2 ~]$ spark-shell -usejavacp -classpath "*.jar"
> 14/06/02 13:52:13 INFO spark.HttpServer: Starting HTTP Server
> 14/06/02 13:52:13 INFO server.Server: jetty-7.6.8.v20121106
> 14/06/02 13:52:13 INFO server.AbstractConnector: Started
> SocketConnector@0.0.0.0:52814
> Welcome to
>       ____              __
>      / __/__  ___ _____/ /__
>     _\ \/ _ \/ _ `/ __/  '_/
>    /___/ .__/\_,_/_/ /_/\_\   version 0.9.0
>       /_/
>
> Using Scala version 2.10.3 (Java HotSpot(TM) 64-Bit Server VM, Java
> 1.6.0_31)
> Type in expressions to have them evaluated.
> Type :help for more information.
> 14/06/02 13:52:19 INFO slf4j.Slf4jLogger: Slf4jLogger started
> 14/06/02 13:52:19 INFO Remoting: Starting remoting
> 14/06/02 13:52:19 INFO Remoting: Remoting started; listening on addresses
> :[akka.tcp://spark@hivecluster2:46033]
> 14/06/02 13:52:19 INFO Remoting: Remoting now listens on addresses:
> [akka.tcp://spark@hivecluster2:46033]
> 14/06/02 13:52:19 INFO spark.SparkEnv: Registering BlockManagerMaster
> 14/06/02 13:52:19 INFO storage.DiskBlockManager: Created local directory
> at /tmp/spark-local-20140602135219-bd8a
> 14/06/02 13:52:19 INFO storage.MemoryStore: MemoryStore started with
> capacity 294.4 MB.
> 14/06/02 13:52:19 INFO network.ConnectionManager: Bound socket to port
> 50645 with id = ConnectionManagerId(hivecluster2,50645)
> 14/06/02 13:52:19 INFO storage.BlockManagerMaster: Trying to register
> BlockManager
> 14/06/02 13:52:19 INFO storage.BlockManagerMasterActor$BlockManagerInfo:
> Registering block manager hivecluster2:50645 with 294.4 MB RAM
> 14/06/02 13:52:19 INFO storage.BlockManagerMaster: Registered BlockManager
> 14/06/02 13:52:19 INFO spark.HttpServer: Starting HTTP Server
> 14/06/02 13:52:19 INFO server.Server: jetty-7.6.8.v20121106
> 14/06/02 13:52:19 INFO server.AbstractConnector: Started
> SocketConnector@0.0.0.0:36103
> 14/06/02 13:52:19 INFO broadcast.HttpBroadcast: Broadcast server started
> at http://10.10.30.211:36103
> 14/06/02 13:52:19 INFO spark.SparkEnv: Registering MapOutputTracker
> 14/06/02 13:52:19 INFO spark.HttpFileServer: HTTP File server directory is
> /tmp/spark-ecce4c62-fef6-4369-a3d5-e3d7cbd1e00c
> 14/06/02 13:52:19 INFO spark.HttpServer: Starting HTTP Server
> 14/06/02 13:52:19 INFO server.Server: jetty-7.6.8.v20121106
> 14/06/02 13:52:19 INFO server.AbstractConnector: Started
> SocketConnector@0.0.0.0:37662
> 14/06/02 13:52:19 INFO server.Server: jetty-7.6.8.v20121106
> 14/06/02 13:52:19 INFO handler.ContextHandler: started
> o.e.j.s.h.ContextHandler{/storage/rdd,null}
> 14/06/02 13:52:19 INFO handler.ContextHandler: started
> o.e.j.s.h.ContextHandler{/storage,null}
> 14/06/02 13:52:19 INFO handler.ContextHandler: started
> o.e.j.s.h.ContextHandler{/stages/stage,null}
> 14/06/02 13:52:19 INFO handler.ContextHandler: started
> o.e.j.s.h.ContextHandler{/stages/pool,null}
> 14/06/02 13:52:19 INFO handler.ContextHandler: started
> o.e.j.s.h.ContextHandler{/stages,null}
> 14/06/02 13:52:19 INFO handler.ContextHandler: started
> o.e.j.s.h.ContextHandler{/environment,null}
> 14/06/02 13:52:19 INFO handler.ContextHandler: started
> o.e.j.s.h.ContextHandler{/executors,null}
> 14/06/02 13:52:19 INFO handler.ContextHandler: started
> o.e.j.s.h.ContextHandler{/metrics/json,null}
> 14/06/02 13:52:19 INFO handler.ContextHandler: started
> o.e.j.s.h.ContextHandler{/static,null}
> 14/06/02 13:52:19 INFO handler.ContextHandler: started
> o.e.j.s.h.ContextHandler{/,null}
> 14/06/02 13:52:19 INFO server.AbstractConnector: Started
> SelectChannelConnector@0.0.0.0:4040
> 14/06/02 13:52:19 INFO ui.SparkUI: Started Spark Web UI at *http://hivecluster2:4040
> <http://hivecluster2:4040>*
> 14/06/02 13:52:19 INFO client.AppClient$ClientActor: Connecting to master
> spark://hivecluster2:7077...
> 14/06/02 13:52:20 INFO cluster.SparkDeploySchedulerBackend: Connected to
> Spark cluster with app ID app-20140602135220-0000
> Created spark context..
> Spark context available as sc.
>
>
> Note that the Spark Web UI is running at hivecluster2:4040, I get the UI
> when I go there. I verify again that nothing exists at hivecluster2:8080.
>
> I try to run my code:
>
> ...
>
> val sparkConf = new SparkConf()
> sparkConf.setMaster("spark://hivecluster2:7077")
> sparkConf.setAppName("Test Spark App")
> sparkConf.setJars(Array("avro-1.7.6.jar", "avro-mapred-1.7.6.jar"))
> val sc = new SparkContext(sparkConf)
>
> This produces a new spark server(!) at port 4041:
>
>
> 14/06/02 13:55:31 INFO server.AbstractConnector: Started
> SelectChannelConnector@0.0.0.0:4041
> 14/06/02 13:55:31 INFO ui.SparkUI: Started Spark Web UI at
> http://hivecluster2:4041
> 14/06/02 13:55:31 INFO spark.SparkContext: Added JAR avro-1.7.6.jar at
> http://10.10.30.211:49845/jars/avro-1.7.6.jar with timestamp 1401742531616
>  14/06/02 13:55:31 INFO spark.SparkContext: Added JAR
> avro-mapred-1.7.6.jar at
> http://10.10.30.211:49845/jars/avro-mapred-1.7.6.jar with timestamp
> 1401742531617
> 14/06/02 13:55:31 INFO client.AppClient$ClientActor: Connecting to master
> spark://hivecluster2:7077...
> 14/06/02 13:55:31 INFO cluster.SparkDeploySchedulerBackend: Connected to
> Spark cluster with app ID app-20140602135531-0001
> sc: org.apache.spark.SparkContext = org.apache.spark.SparkContext@2e9329e9
>
>
> I run the rest of my code...
>
> val input =
> "hdfs://hivecluster2/securityx/web_proxy_mef/2014/05/29/22/*.avro"//part-m-000{15,16}.avro"
>
> val jobConf= new JobConf(sc.hadoopConfiguration)
>  jobConf.setJobName("Test Scala Job")
> FileInputFormat.setInputPaths(jobConf, input)
>
> val rdd = sc.hadoopRDD(
>   //confBroadcast.value.value,
>   jobConf,
>   classOf[org.apache.avro.mapred.AvroInputFormat[GenericRecord]],
>   classOf[org.apache.avro.mapred.AvroWrapper[GenericRecord]],
>   classOf[org.apache.hadoop.io.NullWritable],
>   1)
>
> val f1 = rdd.first
>
>
> I get this:
>
> 14/06/02 14:00:36 INFO mapred.FileInputFormat: Total input paths to
> process : 17
> 14/06/02 14:00:36 INFO spark.SparkContext: Starting job: first at
> <console>:47
> 14/06/02 14:00:36 INFO scheduler.DAGScheduler: Got job 0 (first at
> <console>:47) with 1 output partitions (allowLocal=true)
> 14/06/02 14:00:36 INFO scheduler.DAGScheduler: Final stage: Stage 0 (first
> at <console>:47)
> 14/06/02 14:00:36 INFO scheduler.DAGScheduler: Parents of final stage:
> List()
> 14/06/02 14:00:36 INFO scheduler.DAGScheduler: Missing parents: List()
> 14/06/02 14:00:36 INFO scheduler.DAGScheduler: Computing the requested
> partition locally
> 14/06/02 14:00:36 INFO rdd.HadoopRDD: Input split:
> hdfs://hivecluster2/securityx/web_proxy_mef/2014/05/29/22/part-m-00000.avro:0+3864
> 14/06/02 14:00:36 INFO spark.SparkContext: Job finished: first at
> <console>:47, took 0.374416468 s
> 14/06/02 14:00:36 INFO spark.SparkContext: Starting job: first at
> <console>:47
> 14/06/02 14:00:36 INFO scheduler.DAGScheduler: Got job 1 (first at
> <console>:47) with 16 output partitions (allowLocal=true)
> 14/06/02 14:00:36 INFO scheduler.DAGScheduler: Final stage: Stage 1 (first
> at <console>:47)
> 14/06/02 14:00:36 INFO scheduler.DAGScheduler: Parents of final stage:
> List()
> 14/06/02 14:00:36 INFO scheduler.DAGScheduler: Missing parents: List()
> 14/06/02 14:00:36 INFO scheduler.DAGScheduler: Submitting Stage 1
> (HadoopRDD[0] at hadoopRDD at <console>:45), which has no missing parents
> 14/06/02 14:00:36 INFO scheduler.DAGScheduler: Submitting 16 missing tasks
> from Stage 1 (HadoopRDD[0] at hadoopRDD at <console>:45)
> 14/06/02 14:00:36 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0
> with 16 tasks
> 14/06/02 14:00:51 WARN scheduler.TaskSchedulerImpl: Initial job has not
> accepted any resources; check your cluster UI to ensure that workers are
> registered and have sufficient memory
>
>
> I see my job at http://hivecluster2:4041, but not at hivecluster2:4040.
> Task succeeded, 0/16.
>
> How do I instantiate a new SparkContext without creating a new web server
> thing? That seems to be the issue.
>
> Russ
>
>
> On Mon, Jun 2, 2014 at 1:19 PM, Aaron Davidson <ilikerps@gmail.com> wrote:
>
>> You may have to do "sudo jps", because it should definitely list your
>> processes.
>>
>> What does hivecluster2:8080 look like? My guess is it says there are 2
>> applications registered, and one has taken all the executors. There must be
>> two applications running, as those are the only things that keep open those
>> 4040/4041 ports.
>>
>>
>> On Mon, Jun 2, 2014 at 11:32 AM, Russell Jurney <russell.jurney@gmail.com
>> > wrote:
>>
>>> If it matters, I have servers running at
>>> http://hivecluster2:4040/stages/ and http://hivecluster2:4041/stages/
>>>
>>> When I run rdd.first, I see an item at
>>> http://hivecluster2:4041/stages/ but no tasks are running. Stage ID 1,
>>> first at <console>:46, Tasks: Succeeded/Total 0/16.
>>>
>>> On Mon, Jun 2, 2014 at 10:09 AM, Russell Jurney
>>> <russell.jurney@gmail.com> wrote:
>>> > Looks like just worker and master processes are running:
>>> >
>>> > [hivedata@hivecluster2 ~]$ jps
>>> >
>>> > 10425 Jps
>>> >
>>> > [hivedata@hivecluster2 ~]$ ps aux|grep spark
>>> >
>>> > hivedata 10424  0.0  0.0 103248   820 pts/3    S+   10:05   0:00 grep
>>> spark
>>> >
>>> > root     10918  0.5  1.4 4752880 230512 ?      Sl   May27  41:43 java
>>> -cp
>>> >
>>> :/opt/cloudera/parcels/SPARK-0.9.0-1.cdh4.6.0.p0.98/lib/spark/conf:/opt/cloudera/parcels/SPARK-0.9.0-1.cdh4.6.0.p0.98/lib/spark/core/lib/*:/opt/cloudera/parcels/SPARK-0.9.0-1.cdh4.6.0.p0.98/lib/spark/repl/lib/*:/opt/cloudera/parcels/SPARK-0.9.0-1.cdh4.6.0.p0.98/lib/spark/examples/lib/*:/opt/cloudera/parcels/SPARK-0.9.0-1.cdh4.6.0.p0.98/lib/spark/bagel/lib/*:/opt/cloudera/parcels/SPARK-0.9.0-1.cdh4.6.0.p0.98/lib/spark/mllib/lib/*:/opt/cloudera/parcels/SPARK-0.9.0-1.cdh4.6.0.p0.98/lib/spark/streaming/lib/*:/opt/cloudera/parcels/SPARK-0.9.0-1.cdh4.6.0.p0.98/lib/spark/lib/*:/etc/hadoop/conf:/opt/cloudera/parcels/CDH/lib/hadoop/*:/opt/cloudera/parcels/CDH/lib/hadoop/../hadoop-hdfs/*:/opt/cloudera/parcels/CDH/lib/hadoop/../hadoop-yarn/*:/opt/cloudera/parcels/CDH/lib/hadoop/../hadoop-mapreduce/*:/opt/cloudera/parcels/SPARK-0.9.0-1.cdh4.6.0.p0.98/lib/spark/lib/scala-library.jar:/opt/cloudera/parcels/SPARK-0.9.0-1.cdh4.6.0.p0.98/lib/spark/lib/scala-compiler.jar:/opt/cloudera/parcels/SPARK-0.9.0-1.cdh4.6.0.p0.98/lib/spark/lib/jline.jar
>>> > -Dspark.akka.logLifecycleEvents=true
>>> >
>>> -Djava.library.path=/opt/cloudera/parcels/SPARK-0.9.0-1.cdh4.6.0.p0.98/lib/spark/lib:/opt/cloudera/parcels/CDH/lib/hadoop/lib/native
>>> > -Xms512m -Xmx512m org.apache.spark.deploy.master.Master --ip
>>> hivecluster2
>>> > --port 7077 --webui-port 18080
>>> >
>>> > root     12715  0.0  0.0 148028   656 ?        S    May27   0:00 sudo
>>> > /opt/cloudera/parcels/SPARK/lib/spark/bin/spark-class
>>> > org.apache.spark.deploy.worker.Worker spark://hivecluster2:7077
>>> >
>>> > root     12716  0.3  1.1 4155884 191340 ?      Sl   May27  30:21 java
>>> -cp
>>> >
>>> :/opt/cloudera/parcels/SPARK/lib/spark/conf:/opt/cloudera/parcels/SPARK/lib/spark/core/lib/*:/opt/cloudera/parcels/SPARK/lib/spark/repl/lib/*:/opt/cloudera/parcels/SPARK/lib/spark/examples/lib/*:/opt/cloudera/parcels/SPARK/lib/spark/bagel/lib/*:/opt/cloudera/parcels/SPARK/lib/spark/mllib/lib/*:/opt/cloudera/parcels/SPARK/lib/spark/streaming/lib/*:/opt/cloudera/parcels/SPARK/lib/spark/lib/*:/etc/hadoop/conf:/opt/cloudera/parcels/CDH/lib/hadoop/*:/opt/cloudera/parcels/CDH/lib/hadoop/../hadoop-hdfs/*:/opt/cloudera/parcels/CDH/lib/hadoop/../hadoop-yarn/*:/opt/cloudera/parcels/CDH/lib/hadoop/../hadoop-mapreduce/*:/opt/cloudera/parcels/SPARK/lib/spark/lib/scala-library.jar:/opt/cloudera/parcels/SPARK/lib/spark/lib/scala-compiler.jar:/opt/cloudera/parcels/SPARK/lib/spark/lib/jline.jar
>>> > -Dspark.akka.logLifecycleEvents=true
>>> >
>>> -Djava.library.path=/opt/cloudera/parcels/SPARK/lib/spark/lib:/opt/cloudera/parcels/CDH/lib/hadoop/lib/native
>>> > -Xms512m -Xmx512m org.apache.spark.deploy.worker.Worker
>>> > spark://hivecluster2:7077
>>> >
>>> >
>>> >
>>> >
>>> > On Sun, Jun 1, 2014 at 7:41 PM, Aaron Davidson <ilikerps@gmail.com>
>>> wrote:
>>> >>
>>> >> Sounds like you have two shells running, and the first one is talking
>>> all
>>> >> your resources. Do a "jps" and kill the other guy, then try again.
>>> >>
>>> >> By the way, you can look at http://localhost:8080 (replace localhost
>>> with
>>> >> the server your Spark Master is running on) to see what applications
>>> are
>>> >> currently started, and what resource allocations they have.
>>> >>
>>> >>
>>> >> On Sun, Jun 1, 2014 at 6:47 PM, Russell Jurney <
>>> russell.jurney@gmail.com>
>>> >> wrote:
>>> >>>
>>> >>> Thanks again. Run results here:
>>> >>> https://gist.github.com/rjurney/dc0efae486ba7d55b7d5
>>> >>>
>>> >>> This time I get a port already in use exception on 4040, but it
isn't
>>> >>> fatal. Then when I run rdd.first, I get this over and over:
>>> >>>
>>> >>> 14/06/01 18:35:40 WARN scheduler.TaskSchedulerImpl: Initial job
has
>>> not
>>> >>> accepted any resources; check your cluster UI to ensure that workers
>>> are
>>> >>> registered and have sufficient memory
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>> On Sun, Jun 1, 2014 at 3:09 PM, Aaron Davidson <ilikerps@gmail.com>
>>> >>> wrote:
>>> >>>>
>>> >>>> You can avoid that by using the constructor that takes a SparkConf,
>>> a la
>>> >>>>
>>> >>>> val conf = new SparkConf()
>>> >>>> conf.setJars("avro.jar", ...)
>>> >>>> val sc = new SparkContext(conf)
>>> >>>>
>>> >>>>
>>> >>>> On Sun, Jun 1, 2014 at 2:32 PM, Russell Jurney
>>> >>>> <russell.jurney@gmail.com> wrote:
>>> >>>>>
>>> >>>>> Followup question: the docs to make a new SparkContext require
>>> that I
>>> >>>>> know where $SPARK_HOME is. However, I have no idea. Any
idea where
>>> that
>>> >>>>> might be?
>>> >>>>>
>>> >>>>>
>>> >>>>> On Sun, Jun 1, 2014 at 10:28 AM, Aaron Davidson <
>>> ilikerps@gmail.com>
>>> >>>>> wrote:
>>> >>>>>>
>>> >>>>>> Gotcha. The easiest way to get your dependencies to
your Executors
>>> >>>>>> would probably be to construct your SparkContext with
all
>>> necessary jars
>>> >>>>>> passed in (as the "jars" parameter), or inside a SparkConf
with
>>> setJars().
>>> >>>>>> Avro is a "necessary jar", but it's possible your application
>>> also needs to
>>> >>>>>> distribute other ones to the cluster.
>>> >>>>>>
>>> >>>>>> An easy way to make sure all your dependencies get shipped
to the
>>> >>>>>> cluster is to create an assembly jar of your application,
and
>>> then you just
>>> >>>>>> need to tell Spark about that jar, which includes all
your
>>> application's
>>> >>>>>> transitive dependencies. Maven and sbt both have pretty
>>> straightforward ways
>>> >>>>>> of producing assembly jars.
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> On Sat, May 31, 2014 at 11:23 PM, Russell Jurney
>>> >>>>>> <russell.jurney@gmail.com> wrote:
>>> >>>>>>>
>>> >>>>>>> Thanks for the fast reply.
>>> >>>>>>>
>>> >>>>>>> I am running CDH 4.4 with the Cloudera Parcel of
Spark 0.9.0, in
>>> >>>>>>> standalone mode.
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>> On Saturday, May 31, 2014, Aaron Davidson <ilikerps@gmail.com>
>>> wrote:
>>> >>>>>>>>
>>> >>>>>>>> First issue was because your cluster was configured
>>> incorrectly. You
>>> >>>>>>>> could probably read 1 file because that was
done on the driver
>>> node, but
>>> >>>>>>>> when it tried to run a job on the cluster, it
failed.
>>> >>>>>>>>
>>> >>>>>>>> Second issue, it seems that the jar containing
avro is not
>>> getting
>>> >>>>>>>> propagated to the Executors. What version of
Spark are you
>>> running on? What
>>> >>>>>>>> deployment mode (YARN, standalone, Mesos)?
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>> On Sat, May 31, 2014 at 9:37 PM, Russell Jurney
>>> >>>>>>>> <russell.jurney@gmail.com> wrote:
>>> >>>>>>>>
>>> >>>>>>>> Now I get this:
>>> >>>>>>>>
>>> >>>>>>>> scala> rdd.first
>>> >>>>>>>>
>>> >>>>>>>> 14/05/31 21:36:28 INFO spark.SparkContext: Starting
job: first
>>> at
>>> >>>>>>>> <console>:41
>>> >>>>>>>>
>>> >>>>>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler:
Got job 4 (first
>>> at
>>> >>>>>>>> <console>:41) with 1 output partitions
(allowLocal=true)
>>> >>>>>>>>
>>> >>>>>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler:
Final stage:
>>> Stage 4
>>> >>>>>>>> (first at <console>:41)
>>> >>>>>>>>
>>> >>>>>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler:
Parents of final
>>> >>>>>>>> stage: List()
>>> >>>>>>>>
>>> >>>>>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler:
Missing parents:
>>> >>>>>>>> List()
>>> >>>>>>>>
>>> >>>>>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler:
Computing the
>>> >>>>>>>> requested partition locally
>>> >>>>>>>>
>>> >>>>>>>> 14/05/31 21:36:28 INFO rdd.HadoopRDD: Input
split:
>>> >>>>>>>>
>>> hdfs://hivecluster2/securityx/web_proxy_mef/2014/05/29/22/part-m-00000.avro:0+3864
>>> >>>>>>>>
>>> >>>>>>>> 14/05/31 21:36:28 INFO spark.SparkContext: Job
finished: first
>>> at
>>> >>>>>>>> <console>:41, took 0.037371256 s
>>> >>>>>>>>
>>> >>>>>>>> 14/05/31 21:36:28 INFO spark.SparkContext: Starting
job: first
>>> at
>>> >>>>>>>> <console>:41
>>> >>>>>>>>
>>> >>>>>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler:
Got job 5 (first
>>> at
>>> >>>>>>>> <console>:41) with 16 output partitions
(allowLocal=true)
>>> >>>>>>>>
>>> >>>>>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler:
Final stage:
>>> Stage 5
>>> >>>>>>>> (first at <console>:41)
>>> >>>>>>>>
>>> >>>>>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler:
Parents of final
>>> >>>>>>>> stage: List()
>>> >>>>>>>>
>>> >>>>>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler:
Missing parents:
>>> >>>>>>>> List()
>>> >>>>>>>>
>>> >>>>>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler:
Submitting Stage
>>> 5
>>> >>>>>>>> (HadoopRDD[0] at hadoopRDD at <console>:37),
which has no
>>> missing parents
>>> >>>>>>>>
>>> >>>>>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler:
Submitting 16
>>> missing
>>> >>>>>>>> tasks from Stage 5 (HadoopRDD[0] at hadoopRDD
at <console>:37)
>>> >>>>>>>>
>>> >>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSchedulerImpl:
Adding task
>>> set
>>> >>>>>>>> 5.0 with 16 tasks
>>> >>>>>>>>
>>> >>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager:
Starting task
>>> 5.0:0
>>> >>>>>>>> as TID 92 on executor 2: hivecluster3 (NODE_LOCAL)
>>> >>>>>>>>
>>> >>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager:
Serialized task
>>> >>>>>>>> 5.0:0 as 1294 bytes in 1 ms
>>> >>>>>>>>
>>> >>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager:
Starting task
>>> 5.0:3
>>> >>>>>>>> as TID 93 on executor 1: hivecluster5.labs.lan
(NODE_LOCAL)
>>> >>>>>>>>
>>> >>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager:
Serialized task
>>> >>>>>>>> 5.0:3 as 1294 bytes in 0 ms
>>> >>>>>>>>
>>> >>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager:
Starting task
>>> 5.0:1
>>> >>>>>>>> as TID 94 on executor 4: hivecluster4 (NODE_LOCAL)
>>> >>>>>>>>
>>> >>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager:
Serialized task
>>> >>>>>>>> 5.0:1 as 1294 bytes in 1 ms
>>> >>>>>>>>
>>> >>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager:
Starting task
>>> 5.0:2
>>> >>>>>>>> as TID 95 on executor 0: hivecluster6.labs.lan
(NODE_LOCAL)
>>> >>>>>>>>
>>> >>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager:
Serialized task
>>> >>>>>>>> 5.0:2 as 1294 bytes in 0 ms
>>> >>>>>>>>
>>> >>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager:
Starting task
>>> 5.0:4
>>> >>>>>>>> as TID 96 on executor 3: hivecluster1.labs.lan
(NODE_LOCAL)
>>> >>>>>>>>
>>> >>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager:
Serialized task
>>> >>>>>>>> 5.0:4 as 1294 bytes in 0 ms
>>> >>>>>>>>
>>> >>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager:
Starting task
>>> 5.0:6
>>> >>>>>>>> as TID 97 on executor 2: hivecluster3 (NODE_LOCAL)
>>> >>>>>>>>
>>> >>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager:
Serialized task
>>> >>>>>>>> 5.0:6 as 1294 bytes in 0 ms
>>> >>>>>>>>
>>> >>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager:
Starting task
>>> 5.0:5
>>> >>>>>>>> as TID 98 on executor 1: hivecluster5.labs.lan
(NODE_LOCAL)
>>> >>>>>>>>
>>> >>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager:
Serialized task
>>> >>>>>>>> 5.0:5 as 1294 bytes in 0 ms
>>> >>>>>>>>
>>> >>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager:
Starting task
>>> 5.0:8
>>> >>>>>>>> as TID 99 on executor 4: hivecluster4 (NODE_LOCAL)
>>> >>>>>>>>
>>> >>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager:
Serialized task
>>> >>>>>>>> 5.0:8 as 1294 bytes in 0 ms
>>> >>>>>>>>
>>> >>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager:
Starting task
>>> 5.0:7
>>> >>>>>>>> as TID 100 on executor 0: hivecluster6.labs.lan
(NODE_LOCAL)
>>> >>>>>>>>
>>> >>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager:
Serialized task
>>> >>>>>>>> 5.0:7 as 1294 bytes in 0 ms
>>> >>>>>>>>
>>> >>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager:
Starting task
>>> >>>>>>>> 5.0:10 as TID 101 on executor 3: hivecluster1.labs.lan
>>> (NODE_LOCAL)
>>> >>>>>>>>
>>> >>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager:
Serialized task
>>> >>>>>>>> 5.0:10 as 1294 bytes in 0 ms
>>> >>>>>>>>
>>> >>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager:
Starting task
>>> >>>>>>>> 5.0:14 as TID 102 on executor 2: hivecluster3
(NODE_LOCAL)
>>> >>>>>>>>
>>> >>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager:
Serialized task
>>> >>>>>>>> 5.0:14 as 1294 bytes in 0 ms
>>> >>>>>>>>
>>> >>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager:
Starting task
>>> 5.0:9
>>> >>>>>>>> as TID 103 on executor 1: hivecluster5.labs.lan
(NODE_LOCAL)
>>> >>>>>>>>
>>> >>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager:
Serialized task
>>> >>>>>>>> 5.0:9 as 1294 bytes in 0 ms
>>> >>>>>>>>
>>> >>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager:
Starting task
>>> >>>>>>>> 5.0:11 as TID 104 on executor 4: hivecluster4
(N
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>> --
>>> >>>>>>> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com
>>> >>>>>>> datasyndrome.com
>>> >>>>>>
>>> >>>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>> --
>>> >>>>> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com
>>> >>>>> datasyndrome.com
>>> >>>>
>>> >>>>
>>> >>>
>>> >>>
>>> >>>
>>> >>> --
>>> >>> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com
>>> >>> datasyndrome.com
>>> >>
>>> >>
>>> >
>>> >
>>> >
>>> > --
>>> > Russell Jurney twitter.com/rjurney russell.jurney@gmail.com
>>> datasyndrome.com
>>>
>>>
>>>
>>> --
>>> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com
>>> datasyndrome.com
>>>
>>
>>
>
>
> --
> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.
> com
>

Mime
View raw message