spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <>
Subject Re: Apache Spark Contribution
Date Fri, 03 Feb 2017 20:49:02 GMT
You might want to look at Nephele: Efficient Parallel Data Processing in the Cloud, Warneke
& Kao, 2009

This was some of the work done in the research project with gave birth to Flink, though this
bit didn't surface as they chose to leave VM allocation to others.

essentially: the query planner could track allocations and lifespans of work, know that if
a VM were to be released, pick the one closest to its our being up, let you choose between
fast but expensive vs slow but (maybe) less expensive, etc, etc.

It's a complex problem, as to do it you need to think about more than just spot load, more
"how to efficiently divide work amongst a pool of machines with different lifespans"

what could be good to look at today would be rather than hard code the logic

-provide metrics information which higher level tools could use to make decisions/send hints
-maybe schedule things to best support pre-emptible nodes in the cluster; the ones where you
bid spot prices for from EC2, get 1 hour guaranteed, then after they can be killed without

preemption-aware scheduling might imply making sure that any critical information is kept
out the preemptible nodes, or at least replicated onto a long-lived one, and have stuff in
the controller ready to react to unannounced pre-emption. FWIW when YARN preempts you do get
notified, and maybe even some very early warning. I don't know if spark uses that.

There is some support in HDFS for declaring that some nodes have interdependent failures,
"failure domains", so you could use that to have HDFS handle replication and only store 1
copy on preemptible VMs, leaving only the scheduling and recovery problem.

Finally, YARN container resizing: lets you ask for more resources when busy, release them
when idle. This may be good for CPU load, though memory management isn't something programs
can ever handle

On 2 Feb 2017, at 19:05, Gabi Cristache <<>>


My name is Gabriel Cristache and I am a student in my final year of a Computer Engineering/Science
University. I want for my Bachelor Thesis to add support for dynamic scaling to a spark streaming

The goal of the project is to develop an algorithm that automatically scales the cluster up
and down based on the volume of data processed by the application.
You will need to balance between quick reaction to traffic spikes (scale up) and avoiding
wasted resources (scale down) by implementing something along the lines of a PID algorithm.

 Do you think this is feasible? And if so are there any hints that you could give me that
would help my objective?

Gabriel Cristache

View raw message