spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Keith Simmons <>
Subject Re: Hung spark executors don't count toward worker memory limit
Date Fri, 10 Oct 2014 01:57:26 GMT
Actually, it looks like even when the job shuts down cleanly, there can be
a few minutes of overlap between the time the next job launches and the
first job actually terminates it's process.  Here's some relevant lines
from my log:

14/10/09 20:49:20 INFO Worker: Asked to kill executor
14/10/09 20:49:20 INFO ExecutorRunner: Runner thread for executor
app-20141009204127-0029/1 interrupted
14/10/09 20:49:20 INFO ExecutorRunner: Killing process!
14/10/09 20:49:20 INFO Worker: Asked to launch executor
app-20141009204508-0030/1 for Job
... More lines about launching new job...
14/10/09 20:51:17 INFO Worker: Executor app-20141009204127-0029/1 finished
with state KILLED

As you can see, the first app didn't actually shutdown until two minutes
after the new job launched.  During that time, I was at double the worker
memory limit.


On Thu, Oct 9, 2014 at 5:06 PM, Keith Simmons <> wrote:

> Hi Folks,
> We have a spark job that is occasionally running out of memory and hanging
> (I believe in GC).  This is it's own issue we're debugging, but in the
> meantime, there's another unfortunate side effect.  When the job is killed
> (most often because of GC errors), each worker attempts to kill its
> respective executor.  However, it appears that several of the executors
> fail to shut themselves down (I actually have to kill -9 them).  However,
> even though the worker fails to successfully cleanup the executor, it
> starts the next job as though all the resources have been freed up.  This
> is causing the spark worker to exceed it's configured memory limit, which
> is in turn running our boxes out of memory.  Is there a setting I can
> configure to prevent this issue?  Perhaps by having the worker force kill
> the executor or not start the next job until it's confirmed the executor
> has exited?  Let me know if there's any additional information I can
> provide.
> Keith
> P.S. We're running spark 1.0.2

View raw message