1. Emr 4.2.0 has Zeppelin as an alternative to DataBricks Notebooks

2. Emr has Ganglia 3.6.0

3. Emr has hadoop fs settings to make s3 work fast (direct.EmrFileSystem)

4. EMR has s3 keys in hadoop configs

5. EMR allows to resize cluster on fly.

6. EMR has aws sdk in spark classpath. Helps to reduce app assembly jar size

7. ec2 script installs all in /root, EMR has dedicated users: hadoop, zeppelin, etc. EMR is similar to Cloudera or Hortonworks

8. There are at least 3 spark-ec2 projects. (in apache/spark, in mesos, in amplab). Master branch in spark has outdated ec2 script. Other projects have broken links in readme. WHAT A MESS!

9. ec2 script has bad documentation and non informative error messages. e.g. readme does not say anything about --private-ips option. If you did not add the flag it will connect to empty string host (localhost) instead of master. Fixed only last week. Not sure if fixed in all branches

10. I think Amazon will include spark-jobserver to EMR soon.

11. You do not need to be aws expert to start EMR cluster. Users can use EMR web ui to start cluster to run some jobs or work in Zeppelun during the day

12. EMR cluster starts in abour 8 min. Ec2 script works longer and you need to be online.

On Dec 1, 2015 9:22 AM, "Jerry Lam" <chilinglam@gmail.com> wrote:
Simply put:

EMR = Hadoop Ecosystem (Yarn, HDFS, etc) + Spark + EMRFS + Amazon EMR API + Selected Instance Types + Amazon EC2 Friendly (bootstrapping)
spark-ec2 = HDFS + Yarn (Optional) + Spark (Standalone Default) + Any Instance Type

I use spark-ec2 for prototyping and I have never use it for production.

just my $0.02

On Dec 1, 2015, at 11:15 AM, Nick Chammas <nicholas.chammas@gmail.com> wrote:

Pinging this thread in case anyone has thoughts on the matter they want to share.

On Sat, Nov 21, 2015 at 11:32 AM Nicholas Chammas <[hidden email]> wrote:
Spark has come bundled with spark-ec2 for many years. At the same time, EMR has been capable of running Spark for a while, and earlier this year it added "official" support.

If you're looking for a way to provision Spark clusters, there are some clear differences between these 2 options. I think the biggest one would be that EMR is a "production" solution backed by a company, whereas spark-ec2 is not really intended for production use (as far as I know).

That particular difference in intended use may or may not matter to you, but I'm curious:

What are some of the other differences between the 2 that do matter to you? If you were considering these 2 solutions for your use case at one point recently, why did you choose one over the other?

I'd be especially interested in hearing about why people might choose spark-ec2 over EMR, since the latter option seems to have shaped up nicely this year.


View this message in context: Re: spark-ec2 vs. EMR
Sent from the Apache Spark User List mailing list archive at Nabble.com.