spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean Georges Perrin <...@jgp.net>
Subject Re: spark.executor.cores
Date Fri, 15 Jul 2016 20:35:13 GMT
Hey Mich,


Oh well, you know, us humble programmers try to modestly understand what the brilliant data
scientists are designing and, I can assure you that it is not easy.

Basically the way I use Spark is in 2 ways:

1) As a developer
I just embed the Spark binaries (jars) in my Maven POM. In the app, when I need to have Spark
do something, I just call the local's master (quick example here: http://jgp.net/2016/06/26/your-very-first-apache-spark-application/
<http://jgp.net/2016/06/26/your-very-first-apache-spark-application/>).

Pro: this is the super-duper easy & lazy way, works like a charm, setup under 5 minutes
with one arm in your back and being blindfolded.
Con: well, I have a MacBook Air, a nice MacBook Air, but still it is only a MacBook Air, with
8GB or RAM and 2 cores... My analysis never finished (but a subset does).

2) As a database
Ok, some will probably find that shocking, but I used Spark as a database on a distance computer
(my sweet Micha). The app connects to Spark, tells it what to do, and the application "consumes"
the data crunching done by Spark on Micha (a bit more of the architecture there: http://jgp.net/2016/07/14/chapel-hill-we-dont-have-a-problem/
<http://jgp.net/2016/07/14/chapel-hill-we-dont-have-a-problem/>). 

Pro: this can scale like crazy (I have benchmarks scheduled)
Con: well... after you went through all the issues I had, I don't see much issues anymore
(except that I still can't set the # of executors -- which starts to make sense).

3) As a remote batch processor
You prepare your "batch" as a jar. I remember using mainframes this way (and using SAS). 

Pro: very friendly to data scientists / researchers as they are used to this batch model.
Con: you need to prepare the batch, send it... The jar also needs to do with the results:
save them in a database? send a mail? send a PDF? call the police?

Do you agree? Any other opinion?

I am not saying one is better than the other, just trying to get a "big picture".

jg




> On Jul 15, 2016, at 2:13 PM, Mich Talebzadeh <mich.talebzadeh@gmail.com> wrote:
> 
> Interesting
> 
> For some stuff I create an uber jar file and use that against spark-submit. I have not
attempted to start the cluster from through application.
> 
> 
> I tend to use a shell program (actually a k-shell) to compile it via maven or sbt and
then run it accordingly. In general you can parameterise everything for runtime parameters
say --driver-memory ${DRIVER_MEMORY} to practically any other parameter . That way I find
it more flexible because I can submit the jar file and the class in any environment and adjust
those runtime parameters accordingly.  There are certain advantages to using spark-submit,
for example, since driver-memory setting encapsulates the JVM, you will need to set the amount
of driver memory for any non-default value before starting JVM by providing the value in spark-submit.
> 
> I would be keen in hearing the pros and cons of the above approach. I am sure you programmers
(Scala/Java) know much more than me :)
> 
> Cheers
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage
or destruction of data or any other property which may arise from relying on this email's
technical content is explicitly disclaimed. The author will in no case be liable for any monetary
damages arising from such loss, damage or destruction.
>  
> 
> On 15 July 2016 at 16:42, Jean Georges Perrin <jgp@jgp.net <mailto:jgp@jgp.net>>
wrote:
> lol - young padawan I am and path to knowledge seeking I am...
> 
> And on this path I also tried (without luck)...
> 
> 		if (restId == 0) {
> 			conf = conf.setExecutorEnv("spark.executor.cores", "22");
> 		} else {
> 			conf = conf.setExecutorEnv("spark.executor.cores", "2");
> 		}
> 
> and
> 
> 		if (restId == 0) {
> 			conf.setExecutorEnv("spark.executor.cores", "22");
> 		} else {
> 			conf.setExecutorEnv("spark.executor.cores", "2");
> 		}
> 
> the only annoying thing I see is we designed some of the work to be handled by the driver/client
app and we will have to rethink a bit the design of the app for that...
> 
> 
>> On Jul 15, 2016, at 11:34 AM, Daniel Darabos <daniel.darabos@lynxanalytics.com
<mailto:daniel.darabos@lynxanalytics.com>> wrote:
>> 
>> Mich's invocation is for starting a Spark application against an already running
Spark standalone cluster. It will not start the cluster for you.
>> 
>> We used to not use "spark-submit", but we started using it when it solved some problem
for us. Perhaps that day has also come for you? :)
>> 
>> On Fri, Jul 15, 2016 at 5:14 PM, Jean Georges Perrin <jgp@jgp.net <mailto:jgp@jgp.net>>
wrote:
>> I don't use submit: I start my standalone cluster and connect to it remotely. Is
that a bad practice?
>> 
>> I'd like to be able to it dynamically as the system knows whether it needs more or
less resources based on its own  context
>> 
>>> On Jul 15, 2016, at 10:55 AM, Mich Talebzadeh <mich.talebzadeh@gmail.com <mailto:mich.talebzadeh@gmail.com>>
wrote:
>>> 
>>> Hi,
>>> 
>>> You can also do all this at env or submit time with spark-submit which I believe
makes it more flexible than coding in.
>>> 
>>> Example
>>> 
>>> ${SPARK_HOME}/bin/spark-submit \
>>>                 --packages com.databricks:spark-csv_2.11:1.3.0 \
>>>                 --driver-memory 2G \
>>>                 --num-executors 2 \
>>>                 --executor-cores 3 \
>>>                 --executor-memory 2G \
>>>                 --master spark://50.140.197.217:7077 <http://50.140.197.217:7077/>
\
>>>                 --conf "spark.scheduler.mode=FAIR" \
>>>                 --conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps"
\
>>>                 --jars /home/hduser/jars/spark-streaming-kafka-assembly_2.10-1.6.1.jar
\
>>>                 --class "${FILE_NAME}" \
>>>                 --conf "spark.ui.port=${SP}" \
>>>  
>>> HTH
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>  
>>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>> 
>>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss,
damage or destruction of data or any other property which may arise from relying on this email's
technical content is explicitly disclaimed. The author will in no case be liable for any monetary
damages arising from such loss, damage or destruction.
>>>  
>>> 
>>> On 15 July 2016 at 13:48, Jean Georges Perrin <jgp@jgp.net <mailto:jgp@jgp.net>>
wrote:
>>> Merci Nihed, this is one of the tests I did :( still not working
>>> 
>>> 
>>> 
>>>> On Jul 15, 2016, at 8:41 AM, nihed mbarek <nihedmm@gmail.com <mailto:nihedmm@gmail.com>>
wrote:
>>>> 
>>>> can you try with : 
>>>> SparkConf conf = new SparkConf().setAppName("NC Eatery app").set("spark.executor.memory",
"4g")
>>>> 				.setMaster("spark://10.0.100.120:7077 <>");
>>>> 		if (restId == 0) {
>>>> 			conf = conf.set("spark.executor.cores", "22");
>>>> 		} else {
>>>> 			conf = conf.set("spark.executor.cores", "2");
>>>> 		}
>>>> 		JavaSparkContext javaSparkContext = new JavaSparkContext(conf);
>>>> 
>>>> On Fri, Jul 15, 2016 at 2:31 PM, Jean Georges Perrin <jgp@jgp.net <mailto:jgp@jgp.net>>
wrote:
>>>> Hi,
>>>> 
>>>> Configuration: standalone cluster, Java, Spark 1.6.2, 24 cores
>>>> 
>>>> My process uses all the cores of my server (good), but I am trying to limit
it so I can actually submit a second job.
>>>> 
>>>> I tried
>>>> 
>>>> 		SparkConf conf = new SparkConf().setAppName("NC Eatery app").set("spark.executor.memory",
"4g")
>>>> 				.setMaster("spark://10.0.100.120:7077 <>");
>>>> 		if (restId == 0) {
>>>> 			conf = conf.set("spark.executor.cores", "22");
>>>> 		} else {
>>>> 			conf = conf.set("spark.executor.cores", "2");
>>>> 		}
>>>> 		JavaSparkContext javaSparkContext = new JavaSparkContext(conf);
>>>> 
>>>> and
>>>> 
>>>> 		SparkConf conf = new SparkConf().setAppName("NC Eatery app").set("spark.executor.memory",
"4g")
>>>> 				.setMaster("spark://10.0.100.120:7077 <>");
>>>> 		if (restId == 0) {
>>>> 			conf.set("spark.executor.cores", "22");
>>>> 		} else {
>>>> 			conf.set("spark.executor.cores", "2");
>>>> 		}
>>>> 		JavaSparkContext javaSparkContext = new JavaSparkContext(conf);
>>>> 
>>>> but it does not seem to take it. Any hint?
>>>> 
>>>> jg
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> 
>>>> M'BAREK Med Nihed,
>>>> Fedora Ambassador, TUNISIA, Northern Africa
>>>> http://www.nihed.com <http://www.nihed.com/>
>>>> 
>>>>  <http://tn.linkedin.com/in/nihed>
>>>> 
>>> 
>>> 
>> 
>> 
> 
> 


Mime
View raw message