Hi everyone!I'm working on a proof-of-concept using Spark that requires the streaming fixes for Kafka that appeared in 0.7.3 and must run in Amazon's Elastic MapReduce.This means the instructions at <http://aws.amazon.com/articles/4926593393724923> aren't useful for at least four reasons:1) The latest Spark AMI is apparently still based on 0.7.2) The bootstrap action has not tracked changes in the <http://scala-lang.org> web site, so it tries to download Scala from the wrong place.3) The bootstrap action uses a fragile regex to identify the master instance.4) The bootstrap action attempts to copy a file that doesn't exist.
Fortunately, someone else created <https://github.com/daithiocrualaoich/spark-emr>, which upgrades Spark to 0.7.2, but still suffers from the other issues. I'm attaching my patch to correct them. I've placed my bootstrap action and Spark tarball on S3, so my Elastic MapReduce CLI invocation to create my test cluster is:./elastic-mapreduce --create --alive --name "Spark/Shark Cluster" --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop --args "--hdfs-key-value,dfs.permissions=false" --bootstrap-action s3://psnively-stuff/install-spark-0.7.3-bootstrap.sh --bootstrap-name "Mesos/Spark/Shark" --instance-type m1.xlarge --instance-count 3This results in a waiting cluster.I then follow the directions on the spark-emr GitHub site to scp up the assembly .jar and sample data. Running the first example:
java -cp /home/hadoop/spark-assembly-1-SNAPSHOT.jar \ org.boringtechiestuff.spark.TweetWordCount /input /outputresults in:17:15:10.267 [main] INFO spark.scheduler.DAGScheduler - Failed to run saveAsTextFile at TweetWordCount.scala:22Exception in thread "main" spark.SparkException: Job failed: Error: Disconnected from Spark clusterat spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:642)at spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:640)at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:60)at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)at spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:640)at spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:303)at spark.scheduler.DAGScheduler.spark$scheduler$DAGScheduler$$run(DAGScheduler.scala:364)at spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:107)Googling around, it seems this is a common result, but with widely varied root causes, suggesting possible room for improvement in exception-handling/fault identification. I've confirmed that /input and /output, with a _temporary subdirectory, exist in HDFS, and the various Hadoop-related logs are (so far) unhelpful.Any advice would be greatly appreciated!Many thanks and best regards,Paul