spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexander Pivovarov <apivova...@gmail.com>
Subject Re: Spark on Yarn vs Standalone
Date Wed, 09 Sep 2015 05:48:53 GMT
The problem which we have now is skew data (2360 tasks done in 5 min, 3
tasks in 40 min and 1 task in 2 hours)

Some people from the team worry that the executor which runs the longest
task can be killed by YARN (because executor might be unresponsive because
of GC or it might occupy more memory than Yarn allows)



On Tue, Sep 8, 2015 at 3:02 PM, Sandy Ryza <sandy.ryza@cloudera.com> wrote:

> Those settings seem reasonable to me.
>
> Are you observing performance that's worse than you would expect?
>
> -Sandy
>
> On Mon, Sep 7, 2015 at 11:22 AM, Alexander Pivovarov <apivovarov@gmail.com
> > wrote:
>
>> Hi Sandy
>>
>> Thank you for your reply
>> Currently we use r3.2xlarge boxes (vCPU: 8, Mem: 61 GiB)
>> with emr setting for Spark "maximizeResourceAllocation": "true"
>>
>> It is automatically converted to Spark settings
>> spark.executor.memory            47924M
>> spark.yarn.executor.memoryOverhead 5324
>>
>> we also set spark.default.parallelism = slave_count * 16
>>
>> Does it look good for you? (we run single heavy job on cluster)
>>
>> Alex
>>
>> On Mon, Sep 7, 2015 at 11:03 AM, Sandy Ryza <sandy.ryza@cloudera.com>
>> wrote:
>>
>>> Hi Alex,
>>>
>>> If they're both configured correctly, there's no reason that Spark
>>> Standalone should provide performance or memory improvement over Spark on
>>> YARN.
>>>
>>> -Sandy
>>>
>>> On Fri, Sep 4, 2015 at 1:24 PM, Alexander Pivovarov <
>>> apivovarov@gmail.com> wrote:
>>>
>>>> Hi Everyone
>>>>
>>>> We are trying the latest aws emr-4.0.0 and Spark and my question is
>>>> about YARN vs Standalone mode.
>>>> Our usecase is
>>>> - start 100-150 nodes cluster every week,
>>>> - run one heavy spark job (5-6 hours)
>>>> - save data to s3
>>>> - stop cluster
>>>>
>>>> Officially aws emr-4.0.0 comes with Spark on Yarn
>>>> It's probably possible to hack emr by creating bootstrap script which
>>>> stops yarn and starts master and slaves on each computer  (to start Spark
>>>> in standalone mode)
>>>>
>>>> My questions are
>>>> - Does Spark standalone provides significant performance / memory
>>>> improvement in comparison to YARN mode?
>>>> - Does it worth hacking official emr Spark on Yarn and switch Spark to
>>>> Standalone mode?
>>>>
>>>>
>>>> I already created comparison table and want you to check if my
>>>> understanding is correct
>>>>
>>>> Lets say r3.2xlarge computer has 52GB ram available for Spark Executor
>>>> JVMs
>>>>
>>>>                     standalone to yarn comparison
>>>>
>>>>
>>>>             STDLN   YARN
>>>>
>>>> can executor allocate up to 52GB ram                           - yes  |
>>>>  yes
>>>>
>>>> will executor be unresponsive after using all 52GB ram because of GC -
>>>> yes  |  yes
>>>>
>>>> additional JVMs on slave except of spark executor        - workr | node
>>>> mngr
>>>>
>>>> are additional JVMs lightweight                                     -
>>>> yes  |  yes
>>>>
>>>>
>>>> Thank you
>>>>
>>>> Alex
>>>>
>>>
>>>
>>
>

Mime
View raw message