spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mich Talebzadeh <mich.talebza...@gmail.com>
Subject Re: A question about Spark Cluster vs Local Mode
Date Thu, 28 Jul 2016 04:17:04 GMT
Hi

These are my notes on this topic.


   -

   *YARN Cluster Mode,* the Spark driver runs inside an application master
   process which is managed by YARN on the cluster, and the client can go away
   after initiating the application. This is invoked with –master yarn
and --deploy-mode
   cluster
   -

   *YARN Client Mode*, the driver runs in the client process, and the
   application master is only used for requesting resources from YARN.
Unlike Spark
   standalone mode, in which the master’s address is specified in the
   --master parameter, in YARN mode the ResourceManager’s address is picked
   up from the Hadoop configuration. Thus, the --master parameter is yarn. This
   is invoked with --deploy-mode client.



Yarn Cluster and Client Considerations



   -

   Client mode requires the process that launched the application remains
   alive. Meaning the host where it lives has to stay alive, and it may not be
   super-friendly to ssh sessions dying, for example, unless you use nohup.
   -

   Client mode driver logs are printed to stderr by default. Granted you
   can change that. In contrast, in cluster mode, they are all collected by
   yarn without any user intervention.
   -

   if your edge node (from where the app is launched) is not part of the
   cluster (e.g., lives in an outside network with firewalls or higher
   latency), you may run into issues.
   -

   In cluster mode, your driver's CPU and memory usage is accounted for in
   YARN. This matters if your edge node is part of the cluster (and could be
   running yarn containers), since in client mode your driver will potentially
   use a lot of CPU/Memory.
   -

   In cluster mode YARN can restart your application without user
   interference. This is useful for things that need to stay up (think a long
   running streaming job, for example).
   -

   If your client is not close to the cluster (e.g. your PC) then you
   definitely want to go cluster to improve performance.
   -

   If your client is close to the cluster (e.g. an edge node) then you
   could go either client or cluster.  Note that by going client, more
   resources are going to be used on the edge node.


HTH


Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 28 July 2016 at 04:39, Yu Wei <yu2003w@hotmail.com> wrote:

> If cluster runs out of memory, it seems that the executor will be
> restarted by cluster manager.
>
>
> Jared, (韦煜)
> Software developer
> Interested in open source software, big data, Linux
> ------------------------------
> *From:* Ascot Moss <ascot.moss@gmail.com>
> *Sent:* Thursday, July 28, 2016 9:48:13 AM
> *To:* user @spark
> *Subject:* A question about Spark Cluster vs Local Mode
>
> Hi,
>
> If I submit the same job to spark in cluster mode, does it mean in cluster
> mode it will be run in cluster memory pool and it will fail if it runs out
> of cluster's memory?
>
> --driver-memory 64g \
>
> --executor-memory 16g \
>
> Regards
>

Mime
View raw message