spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <ste...@cloudera.com.INVALID>
Subject Re: [DISCUSS] Enable blacklisting feature by default in 3.0
Date Wed, 03 Apr 2019 11:10:16 GMT
On Tue, Apr 2, 2019 at 9:39 PM Ankur Gupta <ankur.gupta@cloudera.com> wrote:

> Hi Steve,
>
> Thanks for your feedback. From your email, I could gather the following
> two important points:
>
>    1. Report failures to something (cluster manager) which can opt to
>    destroy the node and request a new one
>    2. Pluggable failure detection algorithms
>
> Regarding #1, current blacklisting implementation does report blacklist
> status to Yarn here
> <https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocatorBlacklistTracker.scala#L126>,
> which can choose to take appropriate action based on failures across
> different applications (though it seems it doesn't currently). This doesn't
> work in static allocation though and for other cluster managers. Those
> issues are still open:
>
>    - https://issues.apache.org/jira/browse/SPARK-24016
>    - https://issues.apache.org/jira/browse/SPARK-19755
>    - https://issues.apache.org/jira/browse/SPARK-23485
>
> Regarding #2, that is a good point but I think that is optional and may
> not be tied to enabling the blacklisting feature in the current form.
>

I'd expect the algorithms to be done in the controllers, as failures were
reported.

One other thing to consider is how to rect where you are down to ~0 nodes.
At that point you may as well give up on the blacklisting because you've
just implicitly shut down the cluster. I seem to remember something (HDFS?)
trying to deal with that


>
> Coming back to the concerns raised by Reynold, Chris and Steve, it seems
> that there are at least two tasks that we need to complete before we decide
> to enable blacklisting by default in it's current form:
>
>    1. Avoid resource starvation because of blacklisting
>    2. Use exponential backoff for blacklisting instead of a configurable
>    threshold
>    3. Report blacklisting status to all cluster managers (I am not sure
>    if this is necessary to move forward though)
>
> Thanks for all the feedback. Please let me know if there are other
> concerns that we would like to resolve before enabling blacklisting.
>
> Thanks,
> Ankur
>
>
>>

Mime
View raw message