spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mich Talebzadeh <mich.talebza...@gmail.com>
Subject Re: spark.executor.cores
Date Fri, 15 Jul 2016 21:18:38 GMT
Great stuff thanks Jean. These are from my notes:

These are the Spark operation modes that I know


   -

   Spark Local - Spark runs on the local host. This is the simplest set up
   and best suited for learners who want to understand different concepts of
   Spark and those performing unit testing.
   -

   Spark Standalone – a simple cluster manager included with Spark that
   makes it easy to set up a cluster.
   -

   YARN Cluster Mode, the Spark driver runs inside an application master
   process which is managed by YARN on the cluster, and the client can go away
   after initiating the application. This is invoked with –master yarn
and --deploy-mode
   cluster
   -

   YARN Client Mode, the driver runs in the client process, and the
   application master is only used for requesting resources from YARN.
Unlike Spark
   standalone mode, in which the master’s address is specified in the
   --master parameter, in YARN mode the ResourceManager’s address is picked
   up from the Hadoop configuration. Thus, the --master parameter is yarn. This
   is invoked with --deploy-mode client

Spark Local is the easiest one. You need to have any master or worker
running. In this mode the driver program (SparkSubmit), the resource
manager and executor all exist within the same JVM. The JVM itself is the
worker thread. This is the one I gather you use on your favourite laptop.

You start it with --local . This will start with one (worker) *thread *or
equivalent to –master local[1]. You can start by more than one thread by
specifying the number of threads *k* in –master local[k]. You can also
start using all available threads with –master local[*]. The degree of
parallelism is defined by the number of threads *k*.

In *Local mode*, you do not need to start master and slaves/workers. In
this mode it is pretty simple and you can run as many JVMs (spark-submit)
as your resources allow (resource meaning memory and cores). Additionally,
the GUI starts by default on port 4040, next one on 4041 and so forth
unless you specifically start it with --conf "spark.ui.port=nnnnn"

Remember this is all about testing your apps. It is NOT a performance test.
What it allows you is to test multiple apps concurrently and more
importantly gets you started and understand various configuration
parameters that Spark uses together with spark-submit executable.


You can of course use spark-shell and spark-sql utilities. These in turn
rely on spark-submit executable to run certain variations of the JVM. In
other words, you are still executing spark-submit. You can pass parameters
to spark-submit with an example shown below:

${SPARK_HOME}/bin/spark-submit \
                --<PACKAGE> \
                --driver-memory 2G \
                --num-executors 1 \
                --executor-memory 2G \
                --master local \
                --executor-cores 2 \
                --conf "spark.scheduler.mode=FAIR" \
                --conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps" \
                --jars <JARS> \
                --class "${FILE_NAME}" \
                --conf "spark.ui.port=4040” \
                --conf "spark.driver.port=54631" \
                --conf "spark.fileserver.port=54731" \
                --conf "spark.blockManager.port=54832" \
                --conf "spark.kryoserializer.buffer.max=512" \
                ${JAR_FILE} \
                >> ${LOG_FILE}


Note that in the above example I am only using modest resources. This is
intentional to ensure that resources are available for the other Spark jobs
that I may be testing on this standalone node.

Alternatively, you can specify some of these parameters when you are
creating a new SparkConf

val sparkConf = new SparkConf().
             setAppName("My appname").
             setMaster("local").
             Set(“num.executors”, “1”).
             set("spark.executor.memory", "2G").
             set(“spark.executor.cores”, “2”).
             set("spark.cores.max", "2").
             set("spark.driver.allowMultipleContexts", "true").
             set("spark.hadoop.validateOutputSpecs", "false")

You can practically run most of your unit testing with Local mode and
deploy variety of options including running SQL queries, reading data from
CSV files, writing to HDFS, creating Hive tables including ORC tables and
doing Spark Streaming.

I like this mode as I can overload my machine with as many as Spark apps as
I can and I am only the one that manages resources. Too much apps and they
will not run.


Cheers


Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 15 July 2016 at 21:35, Jean Georges Perrin <jgp@jgp.net> wrote:

> Hey Mich,
>
>
> Oh well, you know, us humble programmers try to modestly understand what
> the brilliant data scientists are designing and, I can assure you that it
> is not easy.
>
> Basically the way I use Spark is in 2 ways:
>
> 1) As a developer
> I just embed the Spark binaries (jars) in my Maven POM. In the app, when I
> need to have Spark do something, I just call the local's master (quick
> example here:
> http://jgp.net/2016/06/26/your-very-first-apache-spark-application/).
>
> Pro: this is the super-duper easy & lazy way, works like a charm, setup
> under 5 minutes with one arm in your back and being blindfolded.
> Con: well, I have a MacBook Air, a nice MacBook Air, but still it is only
> a MacBook Air, with 8GB or RAM and 2 cores... My analysis never finished
> (but a subset does).
>
> 2) As a database
> Ok, some will probably find that shocking, but I used Spark as a database
> on a distance computer (my sweet Micha). The app connects to Spark, tells
> it what to do, and the application "consumes" the data crunching done by
> Spark on Micha (a bit more of the architecture there:
> http://jgp.net/2016/07/14/chapel-hill-we-dont-have-a-problem/).
>
> Pro: this can scale like crazy (I have benchmarks scheduled)
> Con: well... after you went through all the issues I had, I don't see much
> issues anymore (except that I still can't set the # of executors -- which
> starts to make sense).
>
> 3) As a remote batch processor
> You prepare your "batch" as a jar. I remember using mainframes this way
> (and using SAS).
>
> Pro: very friendly to data scientists / researchers as they are used to
> this batch model.
> Con: you need to prepare the batch, send it... The jar also needs to do
> with the results: save them in a database? send a mail? send a PDF? call
> the police?
>
> Do you agree? Any other opinion?
>
> I am not saying one is better than the other, just trying to get a "big
> picture".
>
> jg
>
>
>
>
> On Jul 15, 2016, at 2:13 PM, Mich Talebzadeh <mich.talebzadeh@gmail.com>
> wrote:
>
> Interesting
>
> For some stuff I create an uber jar file and use that against
> spark-submit. I have not attempted to start the cluster from through
> application.
>
>
> I tend to use a shell program (actually a k-shell) to compile it via maven
> or sbt and then run it accordingly. In general you can parameterise
> everything for runtime parameters say --driver-memory ${DRIVER_MEMORY} to
> practically any other parameter . That way I find it more flexible
> because I can submit the jar file and the class in any environment and
> adjust those runtime parameters accordingly.  There are certain advantages
> to using spark-submit, for example, since driver-memory setting
> encapsulates the JVM, you will need to set the amount of driver memory for
> any non-default value before starting JVM by providing the value in
> spark-submit.
>
> I would be keen in hearing the pros and cons of the above approach. I am
> sure you programmers (Scala/Java) know much more than me :)
>
> Cheers
>
>
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
> http://talebzadehmich.wordpress.com
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 15 July 2016 at 16:42, Jean Georges Perrin <jgp@jgp.net> wrote:
>
>> lol - young padawan I am and path to knowledge seeking I am...
>>
>> And on this path I also tried (without luck)...
>>
>> if (restId == 0) {
>> conf = conf.setExecutorEnv("spark.executor.cores", "22");
>> } else {
>> conf = conf.setExecutorEnv("spark.executor.cores", "2");
>> }
>>
>> and
>>
>> if (restId == 0) {
>> conf.setExecutorEnv("spark.executor.cores", "22");
>> } else {
>> conf.setExecutorEnv("spark.executor.cores", "2");
>> }
>>
>> the only annoying thing I see is we designed some of the work to be
>> handled by the driver/client app and we will have to rethink a bit the
>> design of the app for that...
>>
>>
>> On Jul 15, 2016, at 11:34 AM, Daniel Darabos <
>> daniel.darabos@lynxanalytics.com> wrote:
>>
>> Mich's invocation is for starting a Spark application against an already
>> running Spark standalone cluster. It will not start the cluster for you.
>>
>> We used to not use "spark-submit", but we started using it when it solved
>> some problem for us. Perhaps that day has also come for you? :)
>>
>> On Fri, Jul 15, 2016 at 5:14 PM, Jean Georges Perrin <jgp@jgp.net> wrote:
>>
>>> I don't use submit: I start my standalone cluster and connect to it
>>> remotely. Is that a bad practice?
>>>
>>> I'd like to be able to it dynamically as the system knows whether it
>>> needs more or less resources based on its own  context
>>>
>>> On Jul 15, 2016, at 10:55 AM, Mich Talebzadeh <mich.talebzadeh@gmail.com>
>>> wrote:
>>>
>>> Hi,
>>>
>>> You can also do all this at env or submit time with spark-submit which I
>>> believe makes it more flexible than coding in.
>>>
>>> Example
>>>
>>> ${SPARK_HOME}/bin/spark-submit \
>>>                 --packages com.databricks:spark-csv_2.11:1.3.0 \
>>>                 --driver-memory 2G \
>>>                 --num-executors 2 \
>>>                 --executor-cores 3 \
>>>                 --executor-memory 2G \
>>>                 --master spark://50.140.197.217:7077 \
>>>                 --conf "spark.scheduler.mode=FAIR" \
>>>                 --conf
>>> "spark.executor.extraJavaOptions=-XX:+PrintGCDetails
>>> -XX:+PrintGCTimeStamps" \
>>>                 --jars
>>> /home/hduser/jars/spark-streaming-kafka-assembly_2.10-1.6.1.jar \
>>>                 --class "${FILE_NAME}" \
>>>                 --conf "spark.ui.port=${SP}" \
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 15 July 2016 at 13:48, Jean Georges Perrin <jgp@jgp.net> wrote:
>>>
>>>> Merci Nihed, this is one of the tests I did :( still not working
>>>>
>>>>
>>>>
>>>> On Jul 15, 2016, at 8:41 AM, nihed mbarek <nihedmm@gmail.com> wrote:
>>>>
>>>> can you try with :
>>>> SparkConf conf = new SparkConf().setAppName("NC Eatery app").set(
>>>> "spark.executor.memory", "4g")
>>>> .setMaster("spark://10.0.100.120:7077");
>>>> if (restId == 0) {
>>>> conf = conf.set("spark.executor.cores", "22");
>>>> } else {
>>>> conf = conf.set("spark.executor.cores", "2");
>>>> }
>>>> JavaSparkContext javaSparkContext = new JavaSparkContext(conf);
>>>>
>>>> On Fri, Jul 15, 2016 at 2:31 PM, Jean Georges Perrin <jgp@jgp.net>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Configuration: standalone cluster, Java, Spark 1.6.2, 24 cores
>>>>>
>>>>> My process uses all the cores of my server (good), but I am trying to
>>>>> limit it so I can actually submit a second job.
>>>>>
>>>>> I tried
>>>>>
>>>>> SparkConf conf = new SparkConf().setAppName("NC Eatery app").set(
>>>>> "spark.executor.memory", "4g")
>>>>> .setMaster("spark://10.0.100.120:7077");
>>>>> if (restId == 0) {
>>>>> conf = conf.set("spark.executor.cores", "22");
>>>>> } else {
>>>>> conf = conf.set("spark.executor.cores", "2");
>>>>> }
>>>>> JavaSparkContext javaSparkContext = new JavaSparkContext(conf);
>>>>>
>>>>> and
>>>>>
>>>>> SparkConf conf = new SparkConf().setAppName("NC Eatery app").set(
>>>>> "spark.executor.memory", "4g")
>>>>> .setMaster("spark://10.0.100.120:7077");
>>>>> if (restId == 0) {
>>>>> conf.set("spark.executor.cores", "22");
>>>>> } else {
>>>>> conf.set("spark.executor.cores", "2");
>>>>> }
>>>>> JavaSparkContext javaSparkContext = new JavaSparkContext(conf);
>>>>>
>>>>> but it does not seem to take it. Any hint?
>>>>>
>>>>> jg
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> M'BAREK Med Nihed,
>>>> Fedora Ambassador, TUNISIA, Northern Africa
>>>> http://www.nihed.com
>>>>
>>>> <http://tn.linkedin.com/in/nihed>
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>

Mime
View raw message