spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benjamin Hindman <benjamin.hind...@gmail.com>
Subject Re: Problem with running Spark over Mesos in fine-grained mode
Date Wed, 29 Jan 2014 05:20:49 GMT
Hey Marek,

Can you include the stdout and stderr of the executors that are running?
You should be able to get them from the Mesos master web UI or on the slave
directly. If they are empty then you'll need to set up the log4j properties.

Ben.


On Tue, Jan 28, 2014 at 7:07 AM, Marek Wiewiorka
<marek.wiewiorka@gmail.com>wrote:

> Hi All,
> I trying to run my app over Mesos in in fine-grained mode and I'm getting
> the attached errors(both Spark driver and Mesos slave).
> The same app in coarse-grained mode runs fine. The only thing I'm changing
> is setting the spark.mesos.coarse param to true/false.
> System.setProperty("spark.mesos.coarse", "false")
> If set to true everything is fine, if set to false(default)  errors below
> occur.
>
> I'm using:
> Mesos 0.16-rc3
> Spark 0.9.0-rc5
>
> I would appreciate any hints.
>
> Regards,
> Marek
>
>
>
> Spark driver
> =============
> 14/01/28 14:51:08 INFO slf4j.Slf4jLogger: Slf4jLogger started
> 14/01/28 14:51:08 INFO Remoting: Starting remoting
> 14/01/28 14:51:08 INFO Remoting: Remoting started; listening on addresses
> :[akka.tcp://spark@XXXXXX:49899]
> 14/01/28 14:51:08 INFO Remoting: Remoting now listens on addresses:
> [akka.tcp://spark@XXXXXX:49899]
> 14/01/28 14:51:08 INFO spark.SparkEnv: Registering BlockManagerMaster
> 14/01/28 14:51:08 INFO storage.DiskBlockManager: Created local directory
> at /tmp/spark-local-20140128145108-e1d9
> 14/01/28 14:51:08 INFO storage.MemoryStore: MemoryStore started with
> capacity 589.2 MB.
> 14/01/28 14:51:09 INFO network.ConnectionManager: Bound socket to port
> 34442 with id = ConnectionManagerId(XXXXXX,34442)
> 14/01/28 14:51:09 INFO storage.BlockManagerMaster: Trying to register
> BlockManager
> 14/01/28 14:51:09 INFO storage.BlockManagerMasterActor$BlockManagerInfo:
> Registering block manager XXXXXX:34442 with 589.2 MB RAM
> 14/01/28 14:51:09 INFO storage.BlockManagerMaster: Registered BlockManager
> 14/01/28 14:51:09 INFO spark.HttpServer: Starting HTTP Server
> 14/01/28 14:51:09 INFO server.Server: jetty-7.6.8.v20121106
> 14/01/28 14:51:09 INFO server.AbstractConnector: Started
> SocketConnector@0.0.0.0:41485
> 14/01/28 14:51:09 INFO broadcast.HttpBroadcast: Broadcast server started
> at http://10.0.0.8:41485
> 14/01/28 14:51:09 INFO spark.SparkEnv: Registering MapOutputTracker
> 14/01/28 14:51:09 INFO spark.HttpFileServer: HTTP File server directory is
> /tmp/spark-fcd06bbb-225d-4de4-b5c7-75b054c6a67d
> 14/01/28 14:51:09 INFO spark.HttpServer: Starting HTTP Server
> 14/01/28 14:51:09 INFO server.Server: jetty-7.6.8.v20121106
> 14/01/28 14:51:09 INFO server.AbstractConnector: Started
> SocketConnector@0.0.0.0:47676
> 14/01/28 14:51:09 INFO server.Server: jetty-7.6.8.v20121106
> 14/01/28 14:51:09 INFO handler.ContextHandler: started
> o.e.j.s.h.ContextHandler{/storage/rdd,null}
> 14/01/28 14:51:09 INFO handler.ContextHandler: started
> o.e.j.s.h.ContextHandler{/storage,null}
> 14/01/28 14:51:09 INFO handler.ContextHandler: started
> o.e.j.s.h.ContextHandler{/stages/stage,null}
> 14/01/28 14:51:09 INFO handler.ContextHandler: started
> o.e.j.s.h.ContextHandler{/stages/pool,null}
> 14/01/28 14:51:09 INFO handler.ContextHandler: started
> o.e.j.s.h.ContextHandler{/stages,null}
> 14/01/28 14:51:09 INFO handler.ContextHandler: started
> o.e.j.s.h.ContextHandler{/environment,null}
> 14/01/28 14:51:09 INFO handler.ContextHandler: started
> o.e.j.s.h.ContextHandler{/executors,null}
> 14/01/28 14:51:09 INFO handler.ContextHandler: started
> o.e.j.s.h.ContextHandler{/metrics/json,null}
> 14/01/28 14:51:09 INFO handler.ContextHandler: started
> o.e.j.s.h.ContextHandler{/static,null}
> 14/01/28 14:51:09 INFO handler.ContextHandler: started
> o.e.j.s.h.ContextHandler{/,null}
> 14/01/28 14:51:09 INFO server.AbstractConnector: Started
> SelectChannelConnector@0.0.0.0:4040
> 14/01/28 14:51:09 INFO ui.SparkUI: Started Spark Web UI at
> http://XXXXXX:4040
> 14/01/28 14:51:11 INFO storage.MemoryStore: ensureFreeSpace(92955) called
> with curMem=0, maxMem=617820979
> 14/01/28 14:51:11 INFO storage.MemoryStore: Block broadcast_0 stored as
> values to memory (estimated size 90.8 KB, free 589.1 MB)
> 14/01/28 14:51:11 INFO input.FileInputFormat: Total input paths to process
> : 1
> 14/01/28 14:51:12 INFO spark.SparkContext: Starting job: count at
> SparkSeqSimpleJob.scala:48
> 14/01/28 14:51:12 INFO scheduler.DAGScheduler: Registering RDD 4
> (reduceByKey at SparkSeqAnalysis.scala:355)
> 14/01/28 14:51:12 INFO scheduler.DAGScheduler: Got job 0 (count at
> SparkSeqSimpleJob.scala:48) with 25 output partitions (allowLocal=false)
> 14/01/28 14:51:12 INFO scheduler.DAGScheduler: Final stage: Stage 0 (count
> at SparkSeqSimpleJob.scala:48)
> 14/01/28 14:51:12 INFO scheduler.DAGScheduler: Parents of final stage:
> List(Stage 1)
> 14/01/28 14:51:12 INFO scheduler.DAGScheduler: Missing parents: List(Stage
> 1)
> 14/01/28 14:51:12 INFO scheduler.DAGScheduler: Submitting Stage 1
> (MapPartitionsRDD[4] at reduceByKey at SparkSeqAnalysis.scala:355), which
> has no missing parents
> 14/01/28 14:51:12 INFO scheduler.DAGScheduler: Submitting 24 missing tasks
> from Stage 1 (MapPartitionsRDD[4] at reduceByKey at
> SparkSeqAnalysis.scala:355)
> 14/01/28 14:51:12 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0
> with 24 tasks
> 14/01/28 14:51:12 INFO scheduler.TaskSetManager: Starting task 1.0:0 as
> TID 0 on executor 201401281448-134217738-5050-17953-1: XXXXXX
> (PROCESS_LOCAL)
> 14/01/28 14:51:12 INFO scheduler.TaskSetManager: Serialized task 1.0:0 as
> 2500 bytes in 13 ms
> 14/01/28 14:51:12 INFO scheduler.TaskSetManager: Starting task 1.0:1 as
> TID 1 on executor 201401281448-134217738-5050-17953-1: XXXXXX
> (PROCESS_LOCAL)
> 14/01/28 14:51:12 INFO scheduler.TaskSetManager: Serialized task 1.0:1 as
> 2500 bytes in 0 ms
> 14/01/28 14:51:12 INFO scheduler.TaskSetManager: Starting task 1.0:2 as
> TID 2 on executor 201401281448-134217738-5050-17953-1: XXXXXX
> (PROCESS_LOCAL)
> 14/01/28 14:51:12 INFO scheduler.TaskSetManager: Serialized task 1.0:2 as
> 2500 bytes in 1 ms
> 14/01/28 14:51:12 INFO scheduler.TaskSetManager: Starting task 1.0:3 as
> TID 3 on executor 201401281448-134217738-5050-17953-1: XXXXXX
> (PROCESS_LOCAL)
> 14/01/28 14:51:12 INFO scheduler.TaskSetManager: Serialized task 1.0:3 as
> 2500 bytes in 1 ms
> 14/01/28 14:51:20 INFO scheduler.TaskSetManager: Re-queueing tasks for
> 201401281448-134217738-5050-17953-1 from TaskSet 1.0
> 14/01/28 14:51:20 WARN scheduler.TaskSetManager: Lost TID 2 (task 1.0:2)
> 14/01/28 14:51:20 WARN scheduler.TaskSetManager: Lost TID 1 (task 1.0:1)
> 14/01/28 14:51:20 WARN scheduler.TaskSetManager: Lost TID 3 (task 1.0:3)
> 14/01/28 14:51:20 WARN scheduler.TaskSetManager: Lost TID 0 (task 1.0:0)
> 14/01/28 14:51:20 INFO scheduler.DAGScheduler: Executor lost:
> 201401281448-134217738-5050-17953-1 (epoch 0)
> 14/01/28 14:51:20 INFO storage.BlockManagerMasterActor: Trying to remove
> executor 201401281448-134217738-5050-17953-1 from BlockManagerMaster.
> 14/01/28 14:51:20 INFO storage.BlockManagerMaster: Removed
> 201401281448-134217738-5050-17953-1 successfully in removeExecutor
> 1
>
> Mesos SLAVE LOG
> ============
> I0128 14:51:12.996467 17996 slave.cpp:727] Got assigned task 0 for
> framework 201401281448-134217738-5050-17953-0000
> I0128 14:51:12.996961 17996 slave.cpp:727] Got assigned task 1 for
> framework 201401281448-134217738-5050-17953-0000
> I0128 14:51:12.997133 17996 slave.cpp:727] Got assigned task 2 for
> framework 201401281448-134217738-5050-17953-0000
> I0128 14:51:12.997287 17996 slave.cpp:727] Got assigned task 3 for
> framework 201401281448-134217738-5050-17953-0000
> I0128 14:51:12.997429 17996 slave.cpp:836] Launching task 0 for framework
> 201401281448-134217738-5050-17953-0000
> I0128 14:51:12.998803 17996 slave.cpp:946] Queuing task '0' for executor
> 201401281448-134217738-5050-17953-1 of framework
> '201401281448-134217738-5050-17953-0000
> I0128 14:51:12.998819 17993 process_isolator.cpp:100] Launching
> 201401281448-134217738-5050-17953-1 (cd spark-0*; ./sbin/spark-executor) in
> /tmp/mesos/slaves/201401281448-134217738-5050-17953-1/frameworks/201401281448-134217738-5050-17953-0000/executors/201401281448-134217738-5050-17953-1/runs/a4c7bbb6-2ba2-4b03-9cd7-5bc22e12c1c9
> with resources mem(*):10240' for framework
> 201401281448-134217738-5050-17953-0000
> I0128 14:51:12.998883 17996 slave.cpp:836] Launching task 1 for framework
> 201401281448-134217738-5050-17953-0000
> I0128 14:51:12.998914 17996 slave.cpp:946] Queuing task '1' for executor
> 201401281448-134217738-5050-17953-1 of framework
> '201401281448-134217738-5050-17953-0000
> I0128 14:51:12.998947 17996 slave.cpp:836] Launching task 2 for framework
> 201401281448-134217738-5050-17953-0000
> I0128 14:51:12.998971 17996 slave.cpp:946] Queuing task '2' for executor
> 201401281448-134217738-5050-17953-1 of framework
> '201401281448-134217738-5050-17953-0000
> I0128 14:51:12.999003 17996 slave.cpp:836] Launching task 3 for framework
> 201401281448-134217738-5050-17953-0000
> I0128 14:51:12.999943 17996 slave.cpp:946] Queuing task '3' for executor
> 201401281448-134217738-5050-17953-1 of framework
> '201401281448-134217738-5050-17953-0000
> I0128 14:51:13.000185 17993 process_isolator.cpp:163] Forked executor at
> 18386
> I0128 14:51:13.001010 17993 slave.cpp:2089] Monitoring executor
> 201401281448-134217738-5050-17953-1 of framework
> 201401281448-134217738-5050-17953-0000 forked at pid 18386
> I0128 14:51:17.422936 17991 slave.cpp:1422] Got registration for executor
> '201401281448-134217738-5050-17953-1' of framework
> 201401281448-134217738-5050-17953-0000
> I0128 14:51:17.423828 17991 slave.cpp:1543] Flushing queued task 0 for
> executor '201401281448-134217738-5050-17953-1' of framework
> 201401281448-134217738-5050-17953-0000
> I0128 14:51:17.424185 17991 slave.cpp:1543] Flushing queued task 1 for
> executor '201401281448-134217738-5050-17953-1' of framework
> 201401281448-134217738-5050-17953-0000
> I0128 14:51:17.424612 17991 slave.cpp:1543] Flushing queued task 2 for
> executor '201401281448-134217738-5050-17953-1' of framework
> 201401281448-134217738-5050-17953-0000
> I0128 14:51:17.425472 17991 slave.cpp:1543] Flushing queued task 3 for
> executor '201401281448-134217738-5050-17953-1' of framework
> 201401281448-134217738-5050-17953-0000
> I0128 14:51:20.427978 17991 process_isolator.cpp:482] Telling slave of
> terminated executor '201401281448-134217738-5050-17953-1' of framework
> 201401281448-134217738-5050-17953-0000
> I0128 14:51:20.428513 17992 slave.cpp:2145] Executor
> '201401281448-134217738-5050-17953-1' of framework
> 201401281448-134217738-5050-17953-0000 has exited with status 0
> I0128 14:51:20.433177 17992 slave.cpp:1756] Handling status update
> TASK_LOST (UUID: 67c3ef56-79b7-4fe3-9244-36250d725475) for task 0 of
> framework 201401281448-134217738-5050-17953-0000 from @0.0.0.0:0
> I0128 14:51:20.433763 17993 status_update_manager.cpp:312] Received status
> update TASK_LOST (UUID: 67c3ef56-79b7-4fe3-9244-36250d725475) for task 0 of
> framework 201401281448-134217738-5050-17953-0000
> I0128 14:51:20.434661 17992 slave.cpp:1756] Handling status update
> TASK_LOST (UUID: a55053bb-358c-48ff-9360-52e750dd5b2b) for task 1 of
> framework 201401281448-134217738-5050-17953-0000 from @0.0.0.0:0
> I0128 14:51:20.435967 17993 status_update_manager.cpp:365] Forwarding
> status update TASK_LOST (UUID: 67c3ef56-79b7-4fe3-9244-36250d725475) for
> task 0 of framework 201401281448-134217738-5050-17953-0000 to
> master@10.0.0.8:5050
> I0128 14:51:20.436715 17993 status_update_manager.cpp:312] Received status
> update TASK_LOST (UUID: a55053bb-358c-48ff-9360-52e750dd5b2b) for task 1 of
> framework 201401281448-134217738-5050-17953-0000
> I0128 14:51:20.439827 17993 status_update_manager.cpp:365] Forwarding
> status update TASK_LOST (UUID: a55053bb-358c-48ff-9360-52e750dd5b2b) for
> task 1 of framework 201401281448-134217738-5050-17953-0000 to
> master@10.0.0.8:5050
> W0128 14:51:20.440202 17991 process_isolator.cpp:268] Failed to kill the
> process tree rooted at pid 18386: Failed to find process 18386
> I0128 14:51:20.440649 17991 process_isolator.cpp:301] Asked to update
> resources for an unknown/killed executor
> '201401281448-134217738-5050-17953-1' of framework
> 201401281448-134217738-5050-17953-0000
> I0128 14:51:20.451300 17991 process_isolator.cpp:301] Asked to update
> resources for an unknown/killed executor
> '201401281448-134217738-5050-17953-1' of framework
> 201401281448-134217738-5050-17953-0000
> I0128 14:51:20.436820 17992 slave.cpp:1756] Handling status update
> TASK_LOST (UUID: 2b34aa78-5d0e-4704-8a5b-a1c69b0041ab) for task 2 of
> framework 201401281448-134217738-5050-17953-0000 from @0.0.0.0:0
> I0128 14:51:20.452231 17995 process_isolator.cpp:301] Asked to update
> resources for an unknown/killed executor
> '201401281448-134217738-5050-17953-1' of framework
> 201401281448-134217738-5050-17953-0000
> I0128 14:51:20.452239 17994 status_update_manager.cpp:312] Received status
> update TASK_LOST (UUID: 2b34aa78-5d0e-4704-8a5b-a1c69b0041ab) for task 2 of
> framework 201401281448-134217738-5050-17953-0000
> I0128 14:51:20.452981 17994 status_update_manager.cpp:365] Forwarding
> status update TASK_LOST (UUID: 2b34aa78-5d0e-4704-8a5b-a1c69b0041ab) for
> task 2 of framework 201401281448-134217738-5050-17953-0000 to
> master@10.0.0.8:5050
> I0128 14:51:20.453176 17992 slave.cpp:1756] Handling status update
> TASK_LOST (UUID: 98add22a-d891-4615-a03a-98ae0a48254b) for task 3 of
> framework 201401281448-134217738-5050-17953-0000 from @0.0.0.0:0
> I0128 14:51:20.453801 17998 process_isolator.cpp:301] Asked to update
> resources for an unknown/killed executor
> '201401281448-134217738-5050-17953-1' of framework
> 201401281448-134217738-5050-17953-0000
> I0128 14:51:20.453943 17992 status_update_manager.cpp:312] Received status
> update TASK_LOST (UUID: 98add22a-d891-4615-a03a-98ae0a48254b) for task 3 of
> framework 201401281448-134217738-5050-17953-0000
> I0128 14:51:20.455003 17992 status_update_manager.cpp:365] Forwarding
> status update TASK_LOST (UUID: 98add22a-d891-4615-a03a-98ae0a48254b) for
> task 3 of framework 201401281448-134217738-5050-17953-0000 to
> master@10.0.0.8:5050
> I0128 14:51:20.489073 17997 status_update_manager.cpp:390] Received status
> update acknowledgement (UUID: 67c3ef56-79b7-4fe3-9244-36250d725475) for
> task 0 of framework 201401281448-134217738-5050-17953-0000
> I0128 14:51:20.489478 17997 status_update_manager.cpp:390] Received status
> update acknowledgement (UUID: a55053bb-358c-48ff-9360-52e750dd5b2b) for
> task 1 of framework 201401281448-134217738-5050-17953-0000
> I0128 14:51:20.490196 17997 status_update_manager.cpp:390] Received status
> update acknowledgement (UUID: 2b34aa78-5d0e-4704-8a5b-a1c69b0041ab) for
> task 2 of framework 201401281448-134217738-5050-17953-0000
> I0128 14:51:20.490628 17997 status_update_manager.cpp:390] Received status
> update acknowledgement (UUID: 98add22a-d891-4615-a03a-98ae0a48254b) for
> task 3 of framework 201401281448-134217738-5050-17953-0000
> I0128 14:51:20.490890 17997 slave.cpp:2280] Cleaning up executor
> '201401281448-134217738-5050-17953-1' of framework
> 201401281448-134217738-5050-17953-0000
> I0128 14:51:20.491407 17997 slave.cpp:2351] Cleaning up framework
> 201401281448-134217738-5050-17953-0000
>
>

Mime
View raw message