spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Davidson <ilike...@gmail.com>
Subject Re: task always lost
Date Thu, 03 Jul 2014 16:37:24 GMT
The issue you're seeing is not the same as the one you linked to -- your
serialized task sizes are very small, and Mesos fine-grained mode doesn't
use Akka anyway.

The error log you printed seems to be from some sort of Mesos logs, but do
you happen to have the logs from the actual executors themselves? These
should be Spark logs which hopefully show the actual Exception (or lack
thereof) before the executors die.

The tasks are dying very quickly, so this is probably either related to
your application logic throwing some sort of fatal JVM error or due to your
Mesos setup. I'm not sure if that "Failed to fetch URIs for container" is
fatal or not.


On Wed, Jul 2, 2014 at 2:44 AM, qingyang li <liqingyang1985@gmail.com>
wrote:

> executor always been removed.
>
> someone encountered same issue
> https://groups.google.com/forum/#!topic/spark-users/-mYn6BF-Y5Y
>
> -------------
> 14/07/02 17:41:16 INFO storage.BlockManagerMasterActor: Trying to remove
> executor 20140616-104524-1694607552-5050-26919-1 from BlockManagerMaster.
> 14/07/02 17:41:16 INFO storage.BlockManagerMaster: Removed
> 20140616-104524-1694607552-5050-26919-1 successfully in removeExecutor
> 14/07/02 17:41:16 DEBUG spark.MapOutputTrackerMaster: Increasing epoch to
> 10
> 14/07/02 17:41:16 INFO scheduler.DAGScheduler: Host gained which was in
> lost list earlier: bigdata001
> 14/07/02 17:41:16 DEBUG scheduler.TaskSchedulerImpl: parentName: , name:
> TaskSet_0, runningTasks: 0
> 14/07/02 17:41:16 DEBUG scheduler.TaskSchedulerImpl: parentName: , name:
> TaskSet_0, runningTasks: 0
> 14/07/02 17:41:16 INFO scheduler.TaskSetManager: Starting task 0.0:0 as TID
> 12 on executor 20140616-143932-1694607552-5050-4080-3: bigdata004
> (NODE_LOCAL)
> 14/07/02 17:41:16 INFO scheduler.TaskSetManager: Serialized task 0.0:0 as
> 10785 bytes in 1 ms
> 14/07/02 17:41:16 INFO scheduler.TaskSetManager: Starting task 0.0:1 as TID
> 13 on executor 20140616-104524-1694607552-5050-26919-3: bigdata002
> (NODE_LOCAL
>
>
> 2014-07-02 12:01 GMT+08:00 qingyang li <liqingyang1985@gmail.com>:
>
> > also this one in warning log:
> >
> > E0702 11:35:08.869998 17840 slave.cpp:2310] Container
> > 'af557235-2d5f-4062-aaf3-a747cb3cd0d1' for executor
> > '20140616-104524-1694607552-5050-26919-1' of framework
> > '20140702-113428-1694607552-5050-17766-0000' failed to start: Failed to
> > fetch URIs for container 'af557235-2d5f-4062-aaf3-a747cb3cd0d1': exit
> > status 32512
> >
> >
> > 2014-07-02 11:46 GMT+08:00 qingyang li <liqingyang1985@gmail.com>:
> >
> > Here is the log:
> >>
> >> E0702 10:32:07.599364 14915 slave.cpp:2686] Failed to unmonitor
> container
> >> for executor 20140616-104524-1694607552-5050-26919-1 of framework
> >> 20140702-102939-1694607552-5050-14846-0000: Not monitored
> >>
> >>
> >> 2014-07-02 1:45 GMT+08:00 Aaron Davidson <ilikerps@gmail.com>:
> >>
> >> Can you post the logs from any of the dying executors?
> >>>
> >>>
> >>> On Tue, Jul 1, 2014 at 1:25 AM, qingyang li <liqingyang1985@gmail.com>
> >>> wrote:
> >>>
> >>> > i am using mesos0.19 and spark0.9.0 ,  the mesos cluster is started,
> >>> when I
> >>> > using spark-shell to submit one job, the tasks always lost.  here is
> >>> the
> >>> > log:
> >>> > ----------
> >>> > 14/07/01 16:24:27 INFO DAGScheduler: Host gained which was in lost
> list
> >>> > earlier: bigdata005
> >>> > 14/07/01 16:24:27 INFO TaskSetManager: Starting task 0.0:1 as TID
> 4042
> >>> on
> >>> > executor 20140616-143932-1694607552-5050-4080-2: bigdata005
> >>> (PROCESS_LOCAL)
> >>> > 14/07/01 16:24:27 INFO TaskSetManager: Serialized task 0.0:1 as 1570
> >>> bytes
> >>> > in 0 ms
> >>> > 14/07/01 16:24:28 INFO TaskSetManager: Re-queueing tasks for
> >>> > 20140616-104524-1694607552-5050-26919-1 from TaskSet 0.0
> >>> > 14/07/01 16:24:28 WARN TaskSetManager: Lost TID 4041 (task 0.0:0)
> >>> > 14/07/01 16:24:28 INFO DAGScheduler: Executor lost:
> >>> > 20140616-104524-1694607552-5050-26919-1 (epoch 3427)
> >>> > 14/07/01 16:24:28 INFO BlockManagerMasterActor: Trying to remove
> >>> executor
> >>> > 20140616-104524-1694607552-5050-26919-1 from BlockManagerMaster.
> >>> > 14/07/01 16:24:28 INFO BlockManagerMaster: Removed
> >>> > 20140616-104524-1694607552-5050-26919-1 successfully in
> removeExecutor
> >>> > 14/07/01 16:24:28 INFO TaskSetManager: Re-queueing tasks for
> >>> > 20140616-143932-1694607552-5050-4080-2 from TaskSet 0.0
> >>> > 14/07/01 16:24:28 WARN TaskSetManager: Lost TID 4042 (task 0.0:1)
> >>> > 14/07/01 16:24:28 INFO DAGScheduler: Executor lost:
> >>> > 20140616-143932-1694607552-5050-4080-2 (epoch 3428)
> >>> > 14/07/01 16:24:28 INFO BlockManagerMasterActor: Trying to remove
> >>> executor
> >>> > 20140616-143932-1694607552-5050-4080-2 from BlockManagerMaster.
> >>> > 14/07/01 16:24:28 INFO BlockManagerMaster: Removed
> >>> > 20140616-143932-1694607552-5050-4080-2 successfully in removeExecutor
> >>> > 14/07/01 16:24:28 INFO DAGScheduler: Host gained which was in lost
> list
> >>> > earlier: bigdata005
> >>> > 14/07/01 16:24:28 INFO DAGScheduler: Host gained which was in lost
> list
> >>> > earlier: bigdata001
> >>> > 14/07/01 16:24:28 INFO TaskSetManager: Starting task 0.0:1 as TID
> 4043
> >>> on
> >>> > executor 20140616-143932-1694607552-5050-4080-2: bigdata005
> >>> (PROCESS_LOCAL)
> >>> > 14/07/01 16:24:28 INFO TaskSetManager: Serialized task 0.0:1 as 1570
> >>> bytes
> >>> > in 0 ms
> >>> > 14/07/01 16:24:28 INFO TaskSetManager: Starting task 0.0:0 as TID
> 4044
> >>> on
> >>> > executor 20140616-104524-1694607552-5050-26919-1: bigdata001
> >>> > (PROCESS_LOCAL)
> >>> > 14/07/01 16:24:28 INFO TaskSetManager: Serialized task 0.0:0 as 1570
> >>> bytes
> >>> > in 0 ms
> >>> >
> >>> >
> >>> > it seems other guy has also encountered such problem,
> >>> >
> >>> >
> >>>
> http://mail-archives.apache.org/mod_mbox/incubator-mesos-dev/201305.mbox/%3C201305161047069952830@nfs.iscas.ac.cn%3E
> >>> >
> >>>
> >>
> >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message