spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mridul Muralidharan (JIRA)" <>
Subject [jira] [Commented] (SPARK-1453) Improve the way Spark on Yarn waits for executors before starting
Date Fri, 11 Apr 2014 22:43:18 GMT


Mridul Muralidharan commented on SPARK-1453:

The timeout gets hit only when we dont get requested executors, right ? So it is more like
max timeout (controlled by number of times we loop iirc).
The reason for keeping it stupid was simply because we have no gaurantees of number of containers
which might be available to spark in a busy cluster : at times, it might not be practically
possible to even get a fraction of the requested nodes (either due to busy cluster, or because
of lack of resources - so infinite wait).

Ideally, I should have exposed the number of containers allocated - so that atleast user code
could use it as spi and decide how to proceed for more complex cases. Missed out on this one.

I am not sure which usecases make sense.
a) Wait for X seconds or requested containers allocated.
b) Wait until minimum of Y containers allocated (out of X requested).
c) (b) with (a) - that is min containers and timeout on that.
d) (c) with exit if min containers not allocated ?

(d) is something which I keep hitting into (if I dont get my required minimum nodes, and job
proceeds, I usually end up bringing down those nodes :-( )

> Improve the way Spark on Yarn waits for executors before starting
> -----------------------------------------------------------------
>                 Key: SPARK-1453
>                 URL:
>             Project: Spark
>          Issue Type: Improvement
>          Components: YARN
>    Affects Versions: 1.0.0
>            Reporter: Thomas Graves
>            Assignee: Thomas Graves
> Currently Spark on Yarn just delays a few seconds between when the spark context is initialized
and when it allows the job to start.  If you are on a busy hadoop cluster is might take longer
to get the number of executors. 
> In the very least we could make this timeout a configurable value.  Its currently hardcoded
to 3 seconds.  
> Better yet would be to allow user to give a minimum number of executors it wants to wait
for, but that looks much more complex. 

This message was sent by Atlassian JIRA

View raw message