spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Ash <and...@andrewash.com>
Subject Re: Larger heap leads to perf degradation due to GC
Date Thu, 16 Oct 2014 18:46:26 GMT
I wonder if it's worth logging a warning on executor startup if the heap
size is >32GB?  Given that this has come up a few times it might be good to
embed this community knowledge into the product instead of email threads.

On Thu, Oct 16, 2014 at 7:47 AM, Akshat Aranya <aaranya@gmail.com> wrote:

> I just want to pitch in and say that I ran into the same problem with
> running with 64GB executors.  For example, some of the tasks take 5 minutes
> to execute, out of which 4 minutes are spent in GC.  I'll try out smaller
> executors.
>
> On Mon, Oct 6, 2014 at 6:35 PM, Otis Gospodnetic <
> otis.gospodnetic@gmail.com> wrote:
>
>> Hi,
>>
>> The other option to consider is using G1 GC, which should behave better
>> with large heaps.  But pointers are not compressed in heaps > 32 GB in
>> size, so you may be better off staying under 32 GB.
>>
>> Otis
>> --
>> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
>> Solr & Elasticsearch Support * http://sematext.com/
>>
>>
>> On Mon, Oct 6, 2014 at 8:08 PM, Mingyu Kim <mkim@palantir.com> wrote:
>>
>>> Ok, cool. This seems to be general issues in JVM with very large heaps.
>>> I agree that the best workaround would be to keep the heap size below 32GB.
>>> Thanks guys!
>>>
>>> Mingyu
>>>
>>> From: Arun Ahuja <aahuja11@gmail.com>
>>> Date: Monday, October 6, 2014 at 7:50 AM
>>> To: Andrew Ash <andrew@andrewash.com>
>>> Cc: Mingyu Kim <mkim@palantir.com>, "user@spark.apache.org" <
>>> user@spark.apache.org>, Dennis Lawler <dlawler@palantir.com>
>>> Subject: Re: Larger heap leads to perf degradation due to GC
>>>
>>> We have used the strategy that you suggested, Andrew - using many
>>> workers per machine and keeping the heaps small (< 20gb).
>>>
>>> Using a large heap resulted in workers hanging or not responding
>>> (leading to timeouts).  The same dataset/job for us will fail (most often
>>> due to akka disassociated or fetch failures errors) with 10 cores / 100
>>> executors, 60 gb per executor while succceed with 1 core / 1000 executors /
>>> 6gb per executor.
>>>
>>> When the job does succceed with more cores per executor and larger heap
>>> it is usually much slower than the smaller executors (the same 8-10 min job
>>> taking 15-20 min to complete)
>>>
>>> The unfortunate downside of this has been, we have had some large
>>> broadcast variables which may not fit into memory (and unnecessarily
>>> duplicated) when using the smaller executors.
>>>
>>> Most of this is anecdotal but for the most part we have had more success
>>> and consistency with more executors with smaller memory requirements.
>>>
>>> On Sun, Oct 5, 2014 at 7:20 PM, Andrew Ash <andrew@andrewash.com> wrote:
>>>
>>>> Hi Mingyu,
>>>>
>>>> Maybe we should be limiting our heaps to 32GB max and running multiple
>>>> workers per machine to avoid large GC issues.
>>>>
>>>> For a 128GB memory, 32 core machine, this could look like:
>>>>
>>>> SPARK_WORKER_INSTANCES=4
>>>> SPARK_WORKER_MEMORY=32
>>>> SPARK_WORKER_CORES=8
>>>>
>>>> Are people running with large (32GB+) executor heaps in production?
>>>> I'd be curious to hear if so.
>>>>
>>>> Cheers!
>>>> Andrew
>>>>
>>>> On Thu, Oct 2, 2014 at 1:30 PM, Mingyu Kim <mkim@palantir.com> wrote:
>>>>
>>>>> This issue definitely needs more investigation, but I just wanted to
>>>>> quickly check if anyone has run into this problem or has general guidance
>>>>> around it. We’ve seen a performance degradation with a large heap on
a
>>>>> simple map task (I.e. No shuffle). We’ve seen the slowness starting
around
>>>>> from 50GB heap. (I.e. spark.executor.memoty=50g) And, when we checked
the
>>>>> CPU usage, there were just a lot of GCs going on.
>>>>>
>>>>> Has anyone seen a similar problem?
>>>>>
>>>>> Thanks,
>>>>> Mingyu
>>>>>
>>>>
>>>>
>>>
>>
>

Mime
View raw message