The goal of the project is to develop an algorithm that automatically scales the cluster up and down based on the volume of data processed by the application.

By "scale the cluster up and down" do you mean:

1) adding/removing spark executors based on the load? How is that from the dynamic resource allocation that is already supported by spark?

2) Or do you mean scaling the number of servers in a cluster (e.g. like AWS autoscaling)? If so I'm afraid it's out of the scope of spark itself.

Regards,
Shuai

On Fri, Feb 3, 2017 at 3:05 AM, Gabi Cristache <gabi.cristache@gmail.com> wrote:
Hello,

My name is Gabriel Cristache and I am a student in my final year of a Computer Engineering/Science University. I want for my Bachelor Thesis to add support for dynamic scaling to a spark streaming application.


The goal of the project is to develop an algorithm that automatically scales the cluster up and down based on the volume of data processed by the application.

You will need to balance between quick reaction to traffic spikes (scale up) and avoiding wasted resources (scale down) by implementing something along the lines of a PID algorithm.



 Do you think this is feasible? And if so are there any hints that you could give me that would help my objective?


Thanks,

Gabriel Cristache