spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: bitten by spark.yarn.executor.memoryOverhead
Date Mon, 02 Mar 2015 16:45:54 GMT
bq. that 0.1 is "always" enough?

The answer is: it depends (on use cases).
The value of 0.1 has been validated by several users. I think it is a
reasonable default.

Cheers

On Mon, Mar 2, 2015 at 8:36 AM, Ryan Williams <ryan.blake.williams@gmail.com
> wrote:

> For reference, the initial version of #3525
> <https://github.com/apache/spark/pull/3525> (still open) made this
> fraction a configurable value, but consensus went against that being
> desirable so I removed it and marked SPARK-4665
> <https://issues.apache.org/jira/browse/SPARK-4665> as "won't fix".
>
> My team wasted a lot of time on this failure mode as well and has settled
> in to passing "--conf spark.yarn.executor.memoryOverhead=1024" to most
> jobs (that works out to 10-20% of --executor-memory, depending on the job).
>
> I agree that learning about this the hard way is a weak part of the
> Spark-on-YARN onboarding experience.
>
> The fact that our instinct here is to increase the 0.07 minimum instead of
> the alternate 384MB
> <https://github.com/apache/spark/blob/3efd8bb6cf139ce094ff631c7a9c1eb93fdcd566/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnSparkHadoopUtil.scala#L93>
> minimum seems like evidence that the fraction is the thing we should allow
> people to configure, instead of absolute amount that is currently
> configurable.
>
> Finally, do we feel confident that 0.1 is "always" enough?
>
>
> On Sat, Feb 28, 2015 at 4:51 PM Corey Nolet <cjnolet@gmail.com> wrote:
>
>> Thanks for taking this on Ted!
>>
>> On Sat, Feb 28, 2015 at 4:17 PM, Ted Yu <yuzhihong@gmail.com> wrote:
>>
>>> I have created SPARK-6085 with pull request:
>>> https://github.com/apache/spark/pull/4836
>>>
>>> Cheers
>>>
>>> On Sat, Feb 28, 2015 at 12:08 PM, Corey Nolet <cjnolet@gmail.com> wrote:
>>>
>>>> +1 to a better default as well.
>>>>
>>>> We were working find until we ran against a real dataset which was much
>>>> larger than the test dataset we were using locally. It took me a couple
>>>> days and digging through many logs to figure out this value was what was
>>>> causing the problem.
>>>>
>>>> On Sat, Feb 28, 2015 at 11:38 AM, Ted Yu <yuzhihong@gmail.com> wrote:
>>>>
>>>>> Having good out-of-box experience is desirable.
>>>>>
>>>>> +1 on increasing the default.
>>>>>
>>>>>
>>>>> On Sat, Feb 28, 2015 at 8:27 AM, Sean Owen <sowen@cloudera.com>
wrote:
>>>>>
>>>>>> There was a recent discussion about whether to increase or indeed
make
>>>>>> configurable this kind of default fraction. I believe the suggestion
>>>>>> there too was that 9-10% is a safer default.
>>>>>>
>>>>>> Advanced users can lower the resulting overhead value; it may still
>>>>>> have to be increased in some cases, but a fatter default may make
this
>>>>>> kind of surprise less frequent.
>>>>>>
>>>>>> I'd support increasing the default; any other thoughts?
>>>>>>
>>>>>> On Sat, Feb 28, 2015 at 3:34 PM, Koert Kuipers <koert@tresata.com>
>>>>>> wrote:
>>>>>> > hey,
>>>>>> > running my first map-red like (meaning disk-to-disk, avoiding
in
>>>>>> memory
>>>>>> > RDDs) computation in spark on yarn i immediately got bitten
by a
>>>>>> too low
>>>>>> > spark.yarn.executor.memoryOverhead. however it took me about
an
>>>>>> hour to find
>>>>>> > out this was the cause. at first i observed failing shuffles
>>>>>> leading to
>>>>>> > restarting of tasks, then i realized this was because executors
>>>>>> could not be
>>>>>> > reached, then i noticed in containers got shut down and reallocated
>>>>>> in
>>>>>> > resourcemanager logs (no mention of errors, it seemed the containers
>>>>>> > finished their business and shut down successfully), and finally
i
>>>>>> found the
>>>>>> > reason in nodemanager logs.
>>>>>> >
>>>>>> > i dont think this is a pleasent first experience. i realize
>>>>>> > spark.yarn.executor.memoryOverhead needs to be set differently
from
>>>>>> > situation to situation. but shouldnt the default be a somewhat
>>>>>> higher value
>>>>>> > so that these errors are unlikely, and then the experts that
are
>>>>>> willing to
>>>>>> > deal with these errors can tune it lower? so why not make the
>>>>>> default 10%
>>>>>> > instead of 7%? that gives something that works in most situations
>>>>>> out of the
>>>>>> > box (at the cost of being a little wasteful). it worked for
me.
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>

Mime
View raw message