Hi,

I'm having trouble running spark on mesos in fine-grained mode. I'm running spark 1.0.0 and mesos 0.18.0. The tasks are failing randomly, which most of the time, but not always, cause the job to fail. The same code is running fine in coarse-grained mode. I see the following exceptions in the logs of the spark driver:

W0617 10:57:36.774382  8735 sched.cpp:901] Attempting to launch task 21 with an unknown offer 20140416-011500-1369465866-5050-26096-52332715
W0617 10:57:36.774433  8735 sched.cpp:901] Attempting to launch task 22 with an unknown offer 20140416-011500-1369465866-5050-26096-52332715
14/06/17 10:57:36 INFO TaskSetManager: Re-queueing tasks for 201311011608-1369465866-5050-9189-46 from TaskSet 0.0
14/06/17 10:57:36 WARN TaskSetManager: Lost TID 22 (task 0.0:2)
14/06/17 10:57:36 WARN TaskSetManager: Lost TID 19 (task 0.0:0)
14/06/17 10:57:36 WARN TaskSetManager: Lost TID 21 (task 0.0:1)
14/06/17 10:57:36 INFO DAGScheduler: Executor lost: 201311011608-1369465866-5050-9189-46 (epoch 0)
14/06/17 10:57:36 INFO BlockManagerMasterActor: Trying to remove executor 201311011608-1369465866-5050-9189-46 from BlockManagerMaster.
14/06/17 10:57:36 INFO BlockManagerMaster: Removed 201311011608-1369465866-5050-9189-46 successfully in removeExecutor
14/06/17 10:57:36 DEBUG MapOutputTrackerMaster: Increasing epoch to 1
14/06/17 10:57:36 INFO DAGScheduler: Host added was in lost list earlier: ca1-dcc1-0065.lab.mtl

I don't see any exceptions in the spark executor logs. The only error message I found in mesos itself is warnings in the mesos master:

W0617 10:57:36.816748 26100 master.cpp:1615] Failed to validate task 21 : Task 21 attempted to use cpus(*):1 combined with already used cpus(*):1; mem(*):2048 is greater than offered mem(*):3216; disk(*):98304; ports(*):[11900-11919, 1192
1-11995, 11997-11999]; cpus(*):1
W0617 10:57:36.819807 26100 master.cpp:1615] Failed to validate task 22 : Task 22 attempted to use cpus(*):1 combined with already used cpus(*):1; mem(*):2048 is greater than offered mem(*):3216; disk(*):98304; ports(*):[11900-11919, 1192
1-11995, 11997-11999]; cpus(*):1
W0617 10:57:36.932287 26102 master.cpp:1615] Failed to validate task 28 : Task 28 attempted to use cpus(*):1 combined with already used cpus(*):1; mem(*):2048 is greater than offered cpus(*):1; mem(*):3216; disk(*):98304; ports(*):[11900-
11960, 11962-11978, 11980-11999]
W0617 11:05:52.783133 26098 master.cpp:2106] Ignoring unknown exited executor 201311011608-1369465866-5050-9189-46 on slave 201311011608-1369465866-5050-9189-46 (ca1-dcc1-0065.lab.mtl)
W0617 11:05:52.787739 26103 master.cpp:2106] Ignoring unknown exited executor 201311011608-1369465866-5050-9189-34 on slave 201311011608-1369465866-5050-9189-34 (ca1-dcc1-0053.lab.mtl)
W0617 11:05:52.790292 26102 master.cpp:2106] Ignoring unknown exited executor 201311011608-1369465866-5050-9189-59 on slave 201311011608-1369465866-5050-9189-59 (ca1-dcc1-0079.lab.mtl)
W0617 11:05:52.800649 26099 master.cpp:2106] Ignoring unknown exited executor 201311011608-1369465866-5050-9189-18 on slave 201311011608-1369465866-5050-9189-18 (ca1-dcc1-0027.lab.mtl)
... (more of those "Ignoring unknown exited executor")


I analyzed the difference in between the execution of the same job in coarse-grained mode and fine-grained mode, and I noticed that in the fine-grained mode the tasks get executed on executors different than the ones reported in spark, as if spark and mesos get out of sync as to which executor is responsible for which task. See the following:


Coarse-grained mode:

Spark Mesos
Task IndexTask ID ExecutorStatusTask ID (UI)Task Name Task ID (logs)ExecutorState
0066SUCCESS 4"Task 4"0 66RUNNING
1 159SUCCESS0 "Task 0"159 RUNNING
22 54SUCCESS10"Task 10" 254RUNNING
33128 SUCCESS6"Task 6" 3128RUNNING
...


Fine-grained mode:

Spark Mesos
Task IndexTask ID ExecutorTask ID (UI)Task NameTask ID (logs) ExecutorState
0 23108SUCCESS 23"task 0.0:0"23 27FINISHED
0 1965FAILED19 "task 0.0:0"1986 FINISHED
1 2165FAILED Mesos executor was never created
124 92SUCCESS24"task 0.0:1" 24129FINISHED
22265 FAILEDMesos executor was never created
225100SUCCESS 25"task 0.0:2" 2584FINISHED
32680SUCCESS 26"task 0.0:3"26 124FINISHED
42765FAILED 27"task 0.0:4"27 108FINISHED
42992SUCCESS 29"task 0.0:4"29 65FINISHED
52865FAILED Mesos executor was never created
5 3077SUCCESS30 "task 0.0:5"3062 FINISHED
6 053SUCCESS0 "task 0.0:6"041 FINISHED
7 177SUCCESS1 "task 0.0:7"1114 FINISHED
...


Is it normal that the executor reported in spark and mesos to be different when running in fine-grained mode?

Please note that in this particular example the job actually succeeded, but most of the time it's failing after 4 failed attempts of a given task. This job never fails in coarse-grained mode. Every job is working in coarse-grained mode and failing the same way in fine-grained mode.

Does anybody have an idea what the problem could be?

Thanks,

- Sebastien