spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marek Wiewiorka <marek.wiewio...@gmail.com>
Subject Problem with running Spark over Mesos in fine-grained mode
Date Tue, 28 Jan 2014 15:07:56 GMT
Hi All,
I trying to run my app over Mesos in in fine-grained mode and I'm getting
the attached errors(both Spark driver and Mesos slave).
The same app in coarse-grained mode runs fine. The only thing I'm changing
is setting the spark.mesos.coarse param to true/false.
System.setProperty("spark.mesos.coarse", "false")
If set to true everything is fine, if set to false(default)  errors below
occur.

I'm using:
Mesos 0.16-rc3
Spark 0.9.0-rc5

I would appreciate any hints.

Regards,
Marek



Spark driver
=============
14/01/28 14:51:08 INFO slf4j.Slf4jLogger: Slf4jLogger started
14/01/28 14:51:08 INFO Remoting: Starting remoting
14/01/28 14:51:08 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://spark@XXXXXX:49899]
14/01/28 14:51:08 INFO Remoting: Remoting now listens on addresses:
[akka.tcp://spark@XXXXXX:49899]
14/01/28 14:51:08 INFO spark.SparkEnv: Registering BlockManagerMaster
14/01/28 14:51:08 INFO storage.DiskBlockManager: Created local directory at
/tmp/spark-local-20140128145108-e1d9
14/01/28 14:51:08 INFO storage.MemoryStore: MemoryStore started with
capacity 589.2 MB.
14/01/28 14:51:09 INFO network.ConnectionManager: Bound socket to port
34442 with id = ConnectionManagerId(XXXXXX,34442)
14/01/28 14:51:09 INFO storage.BlockManagerMaster: Trying to register
BlockManager
14/01/28 14:51:09 INFO storage.BlockManagerMasterActor$BlockManagerInfo:
Registering block manager XXXXXX:34442 with 589.2 MB RAM
14/01/28 14:51:09 INFO storage.BlockManagerMaster: Registered BlockManager
14/01/28 14:51:09 INFO spark.HttpServer: Starting HTTP Server
14/01/28 14:51:09 INFO server.Server: jetty-7.6.8.v20121106
14/01/28 14:51:09 INFO server.AbstractConnector: Started
SocketConnector@0.0.0.0:41485
14/01/28 14:51:09 INFO broadcast.HttpBroadcast: Broadcast server started at
http://10.0.0.8:41485
14/01/28 14:51:09 INFO spark.SparkEnv: Registering MapOutputTracker
14/01/28 14:51:09 INFO spark.HttpFileServer: HTTP File server directory is
/tmp/spark-fcd06bbb-225d-4de4-b5c7-75b054c6a67d
14/01/28 14:51:09 INFO spark.HttpServer: Starting HTTP Server
14/01/28 14:51:09 INFO server.Server: jetty-7.6.8.v20121106
14/01/28 14:51:09 INFO server.AbstractConnector: Started
SocketConnector@0.0.0.0:47676
14/01/28 14:51:09 INFO server.Server: jetty-7.6.8.v20121106
14/01/28 14:51:09 INFO handler.ContextHandler: started
o.e.j.s.h.ContextHandler{/storage/rdd,null}
14/01/28 14:51:09 INFO handler.ContextHandler: started
o.e.j.s.h.ContextHandler{/storage,null}
14/01/28 14:51:09 INFO handler.ContextHandler: started
o.e.j.s.h.ContextHandler{/stages/stage,null}
14/01/28 14:51:09 INFO handler.ContextHandler: started
o.e.j.s.h.ContextHandler{/stages/pool,null}
14/01/28 14:51:09 INFO handler.ContextHandler: started
o.e.j.s.h.ContextHandler{/stages,null}
14/01/28 14:51:09 INFO handler.ContextHandler: started
o.e.j.s.h.ContextHandler{/environment,null}
14/01/28 14:51:09 INFO handler.ContextHandler: started
o.e.j.s.h.ContextHandler{/executors,null}
14/01/28 14:51:09 INFO handler.ContextHandler: started
o.e.j.s.h.ContextHandler{/metrics/json,null}
14/01/28 14:51:09 INFO handler.ContextHandler: started
o.e.j.s.h.ContextHandler{/static,null}
14/01/28 14:51:09 INFO handler.ContextHandler: started
o.e.j.s.h.ContextHandler{/,null}
14/01/28 14:51:09 INFO server.AbstractConnector: Started
SelectChannelConnector@0.0.0.0:4040
14/01/28 14:51:09 INFO ui.SparkUI: Started Spark Web UI at
http://XXXXXX:4040
14/01/28 14:51:11 INFO storage.MemoryStore: ensureFreeSpace(92955) called
with curMem=0, maxMem=617820979
14/01/28 14:51:11 INFO storage.MemoryStore: Block broadcast_0 stored as
values to memory (estimated size 90.8 KB, free 589.1 MB)
14/01/28 14:51:11 INFO input.FileInputFormat: Total input paths to process
: 1
14/01/28 14:51:12 INFO spark.SparkContext: Starting job: count at
SparkSeqSimpleJob.scala:48
14/01/28 14:51:12 INFO scheduler.DAGScheduler: Registering RDD 4
(reduceByKey at SparkSeqAnalysis.scala:355)
14/01/28 14:51:12 INFO scheduler.DAGScheduler: Got job 0 (count at
SparkSeqSimpleJob.scala:48) with 25 output partitions (allowLocal=false)
14/01/28 14:51:12 INFO scheduler.DAGScheduler: Final stage: Stage 0 (count
at SparkSeqSimpleJob.scala:48)
14/01/28 14:51:12 INFO scheduler.DAGScheduler: Parents of final stage:
List(Stage 1)
14/01/28 14:51:12 INFO scheduler.DAGScheduler: Missing parents: List(Stage
1)
14/01/28 14:51:12 INFO scheduler.DAGScheduler: Submitting Stage 1
(MapPartitionsRDD[4] at reduceByKey at SparkSeqAnalysis.scala:355), which
has no missing parents
14/01/28 14:51:12 INFO scheduler.DAGScheduler: Submitting 24 missing tasks
from Stage 1 (MapPartitionsRDD[4] at reduceByKey at
SparkSeqAnalysis.scala:355)
14/01/28 14:51:12 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0
with 24 tasks
14/01/28 14:51:12 INFO scheduler.TaskSetManager: Starting task 1.0:0 as TID
0 on executor 201401281448-134217738-5050-17953-1: XXXXXX (PROCESS_LOCAL)
14/01/28 14:51:12 INFO scheduler.TaskSetManager: Serialized task 1.0:0 as
2500 bytes in 13 ms
14/01/28 14:51:12 INFO scheduler.TaskSetManager: Starting task 1.0:1 as TID
1 on executor 201401281448-134217738-5050-17953-1: XXXXXX (PROCESS_LOCAL)
14/01/28 14:51:12 INFO scheduler.TaskSetManager: Serialized task 1.0:1 as
2500 bytes in 0 ms
14/01/28 14:51:12 INFO scheduler.TaskSetManager: Starting task 1.0:2 as TID
2 on executor 201401281448-134217738-5050-17953-1: XXXXXX (PROCESS_LOCAL)
14/01/28 14:51:12 INFO scheduler.TaskSetManager: Serialized task 1.0:2 as
2500 bytes in 1 ms
14/01/28 14:51:12 INFO scheduler.TaskSetManager: Starting task 1.0:3 as TID
3 on executor 201401281448-134217738-5050-17953-1: XXXXXX (PROCESS_LOCAL)
14/01/28 14:51:12 INFO scheduler.TaskSetManager: Serialized task 1.0:3 as
2500 bytes in 1 ms
14/01/28 14:51:20 INFO scheduler.TaskSetManager: Re-queueing tasks for
201401281448-134217738-5050-17953-1 from TaskSet 1.0
14/01/28 14:51:20 WARN scheduler.TaskSetManager: Lost TID 2 (task 1.0:2)
14/01/28 14:51:20 WARN scheduler.TaskSetManager: Lost TID 1 (task 1.0:1)
14/01/28 14:51:20 WARN scheduler.TaskSetManager: Lost TID 3 (task 1.0:3)
14/01/28 14:51:20 WARN scheduler.TaskSetManager: Lost TID 0 (task 1.0:0)
14/01/28 14:51:20 INFO scheduler.DAGScheduler: Executor lost:
201401281448-134217738-5050-17953-1 (epoch 0)
14/01/28 14:51:20 INFO storage.BlockManagerMasterActor: Trying to remove
executor 201401281448-134217738-5050-17953-1 from BlockManagerMaster.
14/01/28 14:51:20 INFO storage.BlockManagerMaster: Removed
201401281448-134217738-5050-17953-1 successfully in removeExecutor
1

Mesos SLAVE LOG
============
I0128 14:51:12.996467 17996 slave.cpp:727] Got assigned task 0 for
framework 201401281448-134217738-5050-17953-0000
I0128 14:51:12.996961 17996 slave.cpp:727] Got assigned task 1 for
framework 201401281448-134217738-5050-17953-0000
I0128 14:51:12.997133 17996 slave.cpp:727] Got assigned task 2 for
framework 201401281448-134217738-5050-17953-0000
I0128 14:51:12.997287 17996 slave.cpp:727] Got assigned task 3 for
framework 201401281448-134217738-5050-17953-0000
I0128 14:51:12.997429 17996 slave.cpp:836] Launching task 0 for framework
201401281448-134217738-5050-17953-0000
I0128 14:51:12.998803 17996 slave.cpp:946] Queuing task '0' for executor
201401281448-134217738-5050-17953-1 of framework
'201401281448-134217738-5050-17953-0000
I0128 14:51:12.998819 17993 process_isolator.cpp:100] Launching
201401281448-134217738-5050-17953-1 (cd spark-0*; ./sbin/spark-executor) in
/tmp/mesos/slaves/201401281448-134217738-5050-17953-1/frameworks/201401281448-134217738-5050-17953-0000/executors/201401281448-134217738-5050-17953-1/runs/a4c7bbb6-2ba2-4b03-9cd7-5bc22e12c1c9
with resources mem(*):10240' for framework
201401281448-134217738-5050-17953-0000
I0128 14:51:12.998883 17996 slave.cpp:836] Launching task 1 for framework
201401281448-134217738-5050-17953-0000
I0128 14:51:12.998914 17996 slave.cpp:946] Queuing task '1' for executor
201401281448-134217738-5050-17953-1 of framework
'201401281448-134217738-5050-17953-0000
I0128 14:51:12.998947 17996 slave.cpp:836] Launching task 2 for framework
201401281448-134217738-5050-17953-0000
I0128 14:51:12.998971 17996 slave.cpp:946] Queuing task '2' for executor
201401281448-134217738-5050-17953-1 of framework
'201401281448-134217738-5050-17953-0000
I0128 14:51:12.999003 17996 slave.cpp:836] Launching task 3 for framework
201401281448-134217738-5050-17953-0000
I0128 14:51:12.999943 17996 slave.cpp:946] Queuing task '3' for executor
201401281448-134217738-5050-17953-1 of framework
'201401281448-134217738-5050-17953-0000
I0128 14:51:13.000185 17993 process_isolator.cpp:163] Forked executor at
18386
I0128 14:51:13.001010 17993 slave.cpp:2089] Monitoring executor
201401281448-134217738-5050-17953-1 of framework
201401281448-134217738-5050-17953-0000 forked at pid 18386
I0128 14:51:17.422936 17991 slave.cpp:1422] Got registration for executor
'201401281448-134217738-5050-17953-1' of framework
201401281448-134217738-5050-17953-0000
I0128 14:51:17.423828 17991 slave.cpp:1543] Flushing queued task 0 for
executor '201401281448-134217738-5050-17953-1' of framework
201401281448-134217738-5050-17953-0000
I0128 14:51:17.424185 17991 slave.cpp:1543] Flushing queued task 1 for
executor '201401281448-134217738-5050-17953-1' of framework
201401281448-134217738-5050-17953-0000
I0128 14:51:17.424612 17991 slave.cpp:1543] Flushing queued task 2 for
executor '201401281448-134217738-5050-17953-1' of framework
201401281448-134217738-5050-17953-0000
I0128 14:51:17.425472 17991 slave.cpp:1543] Flushing queued task 3 for
executor '201401281448-134217738-5050-17953-1' of framework
201401281448-134217738-5050-17953-0000
I0128 14:51:20.427978 17991 process_isolator.cpp:482] Telling slave of
terminated executor '201401281448-134217738-5050-17953-1' of framework
201401281448-134217738-5050-17953-0000
I0128 14:51:20.428513 17992 slave.cpp:2145] Executor
'201401281448-134217738-5050-17953-1' of framework
201401281448-134217738-5050-17953-0000 has exited with status 0
I0128 14:51:20.433177 17992 slave.cpp:1756] Handling status update
TASK_LOST (UUID: 67c3ef56-79b7-4fe3-9244-36250d725475) for task 0 of
framework 201401281448-134217738-5050-17953-0000 from @0.0.0.0:0
I0128 14:51:20.433763 17993 status_update_manager.cpp:312] Received status
update TASK_LOST (UUID: 67c3ef56-79b7-4fe3-9244-36250d725475) for task 0 of
framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.434661 17992 slave.cpp:1756] Handling status update
TASK_LOST (UUID: a55053bb-358c-48ff-9360-52e750dd5b2b) for task 1 of
framework 201401281448-134217738-5050-17953-0000 from @0.0.0.0:0
I0128 14:51:20.435967 17993 status_update_manager.cpp:365] Forwarding
status update TASK_LOST (UUID: 67c3ef56-79b7-4fe3-9244-36250d725475) for
task 0 of framework 201401281448-134217738-5050-17953-0000 to
master@10.0.0.8:5050
I0128 14:51:20.436715 17993 status_update_manager.cpp:312] Received status
update TASK_LOST (UUID: a55053bb-358c-48ff-9360-52e750dd5b2b) for task 1 of
framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.439827 17993 status_update_manager.cpp:365] Forwarding
status update TASK_LOST (UUID: a55053bb-358c-48ff-9360-52e750dd5b2b) for
task 1 of framework 201401281448-134217738-5050-17953-0000 to
master@10.0.0.8:5050
W0128 14:51:20.440202 17991 process_isolator.cpp:268] Failed to kill the
process tree rooted at pid 18386: Failed to find process 18386
I0128 14:51:20.440649 17991 process_isolator.cpp:301] Asked to update
resources for an unknown/killed executor
'201401281448-134217738-5050-17953-1' of framework
201401281448-134217738-5050-17953-0000
I0128 14:51:20.451300 17991 process_isolator.cpp:301] Asked to update
resources for an unknown/killed executor
'201401281448-134217738-5050-17953-1' of framework
201401281448-134217738-5050-17953-0000
I0128 14:51:20.436820 17992 slave.cpp:1756] Handling status update
TASK_LOST (UUID: 2b34aa78-5d0e-4704-8a5b-a1c69b0041ab) for task 2 of
framework 201401281448-134217738-5050-17953-0000 from @0.0.0.0:0
I0128 14:51:20.452231 17995 process_isolator.cpp:301] Asked to update
resources for an unknown/killed executor
'201401281448-134217738-5050-17953-1' of framework
201401281448-134217738-5050-17953-0000
I0128 14:51:20.452239 17994 status_update_manager.cpp:312] Received status
update TASK_LOST (UUID: 2b34aa78-5d0e-4704-8a5b-a1c69b0041ab) for task 2 of
framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.452981 17994 status_update_manager.cpp:365] Forwarding
status update TASK_LOST (UUID: 2b34aa78-5d0e-4704-8a5b-a1c69b0041ab) for
task 2 of framework 201401281448-134217738-5050-17953-0000 to
master@10.0.0.8:5050
I0128 14:51:20.453176 17992 slave.cpp:1756] Handling status update
TASK_LOST (UUID: 98add22a-d891-4615-a03a-98ae0a48254b) for task 3 of
framework 201401281448-134217738-5050-17953-0000 from @0.0.0.0:0
I0128 14:51:20.453801 17998 process_isolator.cpp:301] Asked to update
resources for an unknown/killed executor
'201401281448-134217738-5050-17953-1' of framework
201401281448-134217738-5050-17953-0000
I0128 14:51:20.453943 17992 status_update_manager.cpp:312] Received status
update TASK_LOST (UUID: 98add22a-d891-4615-a03a-98ae0a48254b) for task 3 of
framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.455003 17992 status_update_manager.cpp:365] Forwarding
status update TASK_LOST (UUID: 98add22a-d891-4615-a03a-98ae0a48254b) for
task 3 of framework 201401281448-134217738-5050-17953-0000 to
master@10.0.0.8:5050
I0128 14:51:20.489073 17997 status_update_manager.cpp:390] Received status
update acknowledgement (UUID: 67c3ef56-79b7-4fe3-9244-36250d725475) for
task 0 of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.489478 17997 status_update_manager.cpp:390] Received status
update acknowledgement (UUID: a55053bb-358c-48ff-9360-52e750dd5b2b) for
task 1 of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.490196 17997 status_update_manager.cpp:390] Received status
update acknowledgement (UUID: 2b34aa78-5d0e-4704-8a5b-a1c69b0041ab) for
task 2 of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.490628 17997 status_update_manager.cpp:390] Received status
update acknowledgement (UUID: 98add22a-d891-4615-a03a-98ae0a48254b) for
task 3 of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.490890 17997 slave.cpp:2280] Cleaning up executor
'201401281448-134217738-5050-17953-1' of framework
201401281448-134217738-5050-17953-0000
I0128 14:51:20.491407 17997 slave.cpp:2351] Cleaning up framework
201401281448-134217738-5050-17953-0000

Mime
View raw message