spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From twinkle sachdeva <twinkle.sachd...@gmail.com>
Subject Re: Strategy regarding maximum number of executor's failure for log running jobs/ spark streaming jobs
Date Tue, 07 Apr 2015 06:24:01 GMT
Hi,

One of the rational behind killing the app can be to avoid skewness in
data.

I have created this issue (https://issues.apache.org/jira/browse/SPARK-6735)
to provide options for disabling this behaviour, as well as making the
number of executor's failure to be relative with respect to a window
duration.

I will upload the PR shortly.

Thanks,
Twinkle


On Tue, Apr 7, 2015 at 2:02 AM, Sandy Ryza <sandy.ryza@cloudera.com> wrote:

> What's the advantage of killing an application for lack of resources?
>
> I think the rationale behind killing an app based on executor failures is
> that, if we see a lot of them in a short span of time, it means there's
> probably something going wrong in the app or on the cluster.
>
> On Wed, Apr 1, 2015 at 7:08 PM, twinkle sachdeva <
> twinkle.sachdeva@gmail.com> wrote:
>
>> Hi,
>>
>> Thanks Sandy.
>>
>>
>> Another way to look at this is that would we like to have our long
>> running application to die?
>>
>> So let's say, we create a window of around 10 batches, and we are using
>> incremental kind of operations inside our application, as restart here is a
>> relatively more costlier, so should it be the maximum number of executor
>> failure's kind of criteria to fail the application or should we have some
>> parameters around minimum number of executor's availability for some x time?
>>
>> So, if the application is not able to have minimum n number of executors
>> within x period of time, then we should fail the application.
>>
>> Adding time factor here, will allow some window for spark to get more
>> executors allocated if some of them fails.
>>
>> Thoughts please.
>>
>> Thanks,
>> Twinkle
>>
>>
>> On Wed, Apr 1, 2015 at 10:19 PM, Sandy Ryza <sandy.ryza@cloudera.com>
>> wrote:
>>
>>> That's a good question, Twinkle.
>>>
>>> One solution could be to allow a maximum number of failures within any
>>> given time span.  E.g. a max failures per hour property.
>>>
>>> -Sandy
>>>
>>> On Tue, Mar 31, 2015 at 11:52 PM, twinkle sachdeva <
>>> twinkle.sachdeva@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> In spark over YARN, there is a property "spark.yarn.max.executor.failures"
>>>> which controls the maximum number of executor's failure an application will
>>>> survive.
>>>>
>>>> If number of executor's failures ( due to any reason like OOM or
>>>> machine failure etc ), exceeds this value then applications quits.
>>>>
>>>> For small duration spark job, this looks fine, but for the long running
>>>> jobs as this does not take into account the duration, this can lead to same
>>>> treatment for two different scenarios ( mentioned below) :
>>>> 1. executors failing with in 5 mins.
>>>> 2. executors failing sparsely, but at some point even a single executor
>>>> failure ( which application could have survived ) can make the application
>>>> quit.
>>>>
>>>> Sending it to the community to listen what kind of behaviour / strategy
>>>> they think will be suitable for long running spark jobs or spark streaming
>>>> jobs.
>>>>
>>>> Thanks and Regards,
>>>> Twinkle
>>>>
>>>
>>>
>>
>

Mime
View raw message