spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mich Talebzadeh <mich.talebza...@gmail.com>
Subject Re: Running Spark in local mode
Date Sun, 19 Jun 2016 18:30:45 GMT
Thanks Jonathan for your points

I am aware of the fact yarn-client and yarn-cluster are both depreciated
(still work in 1.6.1), hence the new nomenclature.

Bear in mind this is what I stated in my notes:

"YARN Cluster Mode, the Spark driver runs inside an application master
process which is managed by YARN on the cluster, and the client can go away
after initiating the application. This is invoked with –master yarn
and --deploy-mode
cluster
-

YARN Client Mode, the driver runs in the client process, and the
application master is only used for requesting resources from YARN.
-


-

Unlike Spark standalone mode, in which the master’s address is specified in
the --master parameter, in YARN mode the ResourceManager’s address is
picked up from the Hadoop configuration. Thus, the --master parameter is
yarn. This is invoked with --deploy-mode client"

These are exactly from Spark document
<http://spark.apache.org/docs/latest/running-on-yarn.html>and I quote

"There are two deploy modes that can be used to launch Spark applications
on YARN. In cluster mode, the Spark driver runs inside an application
master process which is managed by YARN on the cluster, and the client can
go away after initiating the application.

In client mode, the driver runs in the client process, and the application
master is only used for requesting resources from YARN.

Unlike Spark standalone
<http://spark.apache.org/docs/latest/spark-standalone.html> and Mesos
<http://spark.apache.org/docs/latest/running-on-mesos.html> modes, in which
the master’s address is specified in the --master parameter, in YARN mode
the ResourceManager’s address is picked up from the Hadoop configuration.
Thus, the --master parameter is yarn."

Cheers

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 19 June 2016 at 19:09, Jonathan Kelly <jonathakamzn@gmail.com> wrote:

> Mich, what Jacek is saying is not that you implied that YARN relies on two
> masters. He's just clarifying that yarn-client and yarn-cluster modes are
> really both using the same (type of) master (simply "yarn"). In fact, if
> you specify "--master yarn-client" or "--master yarn-cluster", spark-submit
> will translate that into using a master URL of "yarn" and a deploy-mode of
> "client" or "cluster".
>
> And thanks, Jacek, for the tips on the "less-common master URLs". I had no
> idea that was an option!
>
> ~ Jonathan
>
> On Sun, Jun 19, 2016 at 4:13 AM Mich Talebzadeh <mich.talebzadeh@gmail.com>
> wrote:
>
>> Good points but I am an experimentalist
>>
>> In Local mode I have this
>>
>> In local mode with:
>>
>> --master local
>>
>>
>>
>> This will start with one thread or equivalent to –master local[1]. You
>> can also start by more than one thread by specifying the number of threads
>> *k* in –master local[k]. You can also start using all available threads
>> with –master local[*]which in mine would be local[12].
>>
>> The important thing about Local mode is that number of JVM thrown is
>> controlled by you and you can start as many spark-submit as you wish within
>> constraint of what you get
>>
>> ${SPARK_HOME}/bin/spark-submit \
>>
>>                 --packages com.databricks:spark-csv_2.11:1.3.0 \
>>
>>                 --driver-memory 2G \
>>
>>                 --num-executors 1 \
>>
>>                 --executor-memory 2G \
>>
>>                 --master local \
>>
>>                 --executor-cores 2 \
>>
>>                 --conf "spark.scheduler.mode=FIFO" \
>>
>>                 --conf
>> "spark.executor.extraJavaOptions=-XX:+PrintGCDetails
>> -XX:+PrintGCTimeStamps" \
>>
>>                 --jars
>> /home/hduser/jars/spark-streaming-kafka-assembly_2.10-1.6.1.jar \
>>
>>                 --class "${FILE_NAME}" \
>>
>>                 --conf "spark.ui.port=4040” \
>>
>>                 ${JAR_FILE} \
>>
>>                 >> ${LOG_FILE}
>>
>> Now that does work fine although some of those parameters are implicit
>> (for example cheduler.mode = FIFOR or FAIR and I can start different spark
>> jobs in Local mode. Great for testing.
>>
>> With regard to your comments on Standalone
>>
>> Spark Standalone – a simple cluster manager included with Spark that
>> makes it easy to set up a cluster.
>>
>> s/simple/built-in
>> What is stated as "included" implies that, i.e. it comes as part of
>> running Spark in standalone mode.
>>
>> Your other points on YARN cluster mode and YARN client mode
>>
>> I'd say there's only one YARN master, i.e. --master yarn. You could
>> however say where the driver runs, be it on your local machine where
>> you executed spark-submit or on one node in a YARN cluster.
>>
>>
>> Yes that is I believe what the text implied. I would be very surprised if
>> YARN as a resource manager relies on two masters :)
>>
>>
>> HTH
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 19 June 2016 at 11:46, Jacek Laskowski <jacek@japila.pl> wrote:
>>
>>> On Sun, Jun 19, 2016 at 12:30 PM, Mich Talebzadeh
>>> <mich.talebzadeh@gmail.com> wrote:
>>>
>>> > Spark Local - Spark runs on the local host. This is the simplest set
>>> up and
>>> > best suited for learners who want to understand different concepts of
>>> Spark
>>> > and those performing unit testing.
>>>
>>> There are also the less-common master URLs:
>>>
>>> * local[n, maxRetries] or local[*, maxRetries] — local mode with n
>>> threads and maxRetries number of failures.
>>> * local-cluster[n, cores, memory] for simulating a Spark local cluster
>>> with n workers, # cores per worker, and # memory per worker.
>>>
>>> As of Spark 2.0.0, you could also have your own scheduling system -
>>> see https://issues.apache.org/jira/browse/SPARK-13904 - with the only
>>> known implementation of the ExternalClusterManager contract in Spark
>>> being YarnClusterManager, i.e. whenever you call Spark with --master
>>> yarn.
>>>
>>> > Spark Standalone – a simple cluster manager included with Spark that
>>> makes
>>> > it easy to set up a cluster.
>>>
>>> s/simple/built-in
>>>
>>> > YARN Cluster Mode, the Spark driver runs inside an application master
>>> > process which is managed by YARN on the cluster, and the client can go
>>> away
>>> > after initiating the application. This is invoked with –master yarn and
>>> > --deploy-mode cluster
>>> >
>>> > YARN Client Mode, the driver runs in the client process, and the
>>> application
>>> > master is only used for requesting resources from YARN. Unlike Spark
>>> > standalone mode, in which the master’s address is specified in the
>>> --master
>>> > parameter, in YARN mode the ResourceManager’s address is picked up
>>> from the
>>> > Hadoop configuration. Thus, the --master parameter is yarn. This is
>>> invoked
>>> > with --deploy-mode client
>>>
>>> I'd say there's only one YARN master, i.e. --master yarn. You could
>>> however say where the driver runs, be it on your local machine where
>>> you executed spark-submit or on one node in a YARN cluster.
>>>
>>> The same applies to Spark Standalone and Mesos and is controlled by
>>> --deploy-mode, i.e. client (default) or cluster.
>>>
>>> Please update your notes accordingly ;-)
>>>
>>> Pozdrawiam,
>>> Jacek Laskowski
>>> ----
>>> https://medium.com/@jaceklaskowski/
>>> Mastering Apache Spark http://bit.ly/mastering-apache-spark
>>> Follow me at https://twitter.com/jaceklaskowski
>>>
>>
>>

Mime
View raw message