spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reynold Xin <r...@databricks.com>
Subject Re: Some praise and comments on Spark
Date Wed, 25 Feb 2015 22:36:19 GMT
Thanks for the email and encouragement, Devl. Responses to the 3 requests:

-tonnes of configuration properties and "go faster" type flags. For example
Hadoop and Hbase users will know that there are a whole catalogue of
properties for regions, caches, network properties, block sizes, etc etc.
Please don't end up here for example:
https://hadoop.apache.org/docs/r1.0.4/mapred-default.html, it is painful
having to configure all of this and then create a set of properties for
each environment and then tie this into CI and deployment tools.

As the project grows, it is unavoidable to introduce more config options,
in particular, we often use config options to test new modules that are
still experimental before making them the default (e.g. sort-based shuffle).

The philosophy here is to make it a very high bar to introduce new config
options, and make the default values sensible for most deployments, and
then whenever possible, figure out automatically what is the right setting.
Note that this in general is hard, but we expect for 99% of the users they
only need to know a very small number of options (e.g. setting the
serializer).


-no more daemons and processes to have to monitor and manipulate and
restart and crash.

At the very least you'd need the cluster manager itself to be a daemon
process because we can't defy the law of physics. But I don't think we want
to introduce anything beyond that.


-a project that penalises developers (that will ultimately help promote
Spark to their managers and budget holders) with expensive training,
certification, books and accreditation. Ideally this open source should be
free, free training= more users = more commercial uptake.

I definitely agree with you on making it easier to learn Spark. We are
making a lot of materials freely available, including two freely available
MOOCs on edX:
https://databricks.com/blog/2014/12/02/announcing-two-spark-based-moocs.html



On Wed, Feb 25, 2015 at 2:13 PM, Devl Devel <devl.development@gmail.com>
wrote:

> Hi Spark Developers,
>
> First, apologies if this doesn't belong on this list but the
> comments/praise are relevant to all developers. This is just a small note
> about what we really like about Spark, I/we don't mean to start a whole
> long discussion thread in this forum but just share our positive
> experiences with Spark thus far.
>
> To start, as you can tell, we think that the Spark project is amazing and
> we love it! Having put in nearly half a decade worth of sweat and tears
> into production Hadoop, MapReduce clusters and application development it's
> so refreshing to see something arguably simpler and more elegant to
> supersede it.
>
> These are the things we love about Spark and hope these principles
> continue:
>
> -the one command build; make-distribution.sh. Simple, clean  and ideal for
> deployment and devops and rebuilding on different environments and nodes.
> -not having too much runtime and deploy config; as admins and developers we
> are sick of setting props like io.sort and mapred.job.shuffle.merge.percent
> and dfs file locations and temp directories and so on and on again and
> again every time we deploy a job, new cluster, environment or even change
> company.
> -a fully built-in stack, one global project for SQL, dataframes, MLlib etc,
> so there is no need to add on projects to it on as per Hive, Hue, Hbase
> etc. This helps life and keeps everything in one place.
> -single (global) user based operation - no creation of a hdfs mapred unix
> user, makes life much simpler
> -single quick-start daemons; master and slaves. Not having to worry about
> JT, NN, DN , TT, RM, Hbase master ... and doing netstat and jps on hundreds
> of clusters makes life much easier.
> -proper code versioning, feature releases and release management.
> - good & well organised documentation with good examples.
>
> In addition to the comments above this is where we hope Spark never ends
> up:
>
> -tonnes of configuration properties and "go faster" type flags. For example
> Hadoop and Hbase users will know that there are a whole catalogue of
> properties for regions, caches, network properties, block sizes, etc etc.
> Please don't end up here for example:
> https://hadoop.apache.org/docs/r1.0.4/mapred-default.html, it is painful
> having to configure all of this and then create a set of properties for
> each environment and then tie this into CI and deployment tools.
> -no more daemons and processes to have to monitor and manipulate and
> restart and crash.
> -a project that penalises developers (that will ultimately help promote
> Spark to their managers and budget holders) with expensive training,
> certification, books and accreditation. Ideally this open source should be
> free, free training= more users = more commercial uptake.
>
> Anyway, those are our thoughts for what they are worth, keep up the good
> work, we just had to mention it. Again sorry if this is not the right place
> or if there is another forum for this stuff.
>
> Cheers
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message