Paul could you remove the formatting of your normal text in future, its quite a large font.

with the github model you should probably fork and issue a PR to that repo with any changes as the author very well may not read here often.

New troubleshooting guides/discussions are somewhat in the works to update the FAQ, but in general you need to look at the logs on the workers or the master to see more detailed stuff. 

Any instructions on that github repo wouldn't really be supported here and you might want to raise it on the issues there. It will be the mesos logs you need find not the hadoop ones, since its not a failure of a MR job, but a mesos one. The AWS emr image sets up a mesos cluster on the nodes.

On your new cluster does a word count run from a normal text file? before any streaming operates. Can you read/write to S3. It could be the HDFS interaction is misconfigured or it could be the code, or a whole host of possibilities.

On Tue, Aug 6, 2013 at 10:41 AM, Paul Snively <> wrote:
Hi everyone!

I'm working on a proof-of-concept using Spark that requires the streaming fixes for Kafka that appeared in 0.7.3 and must run in Amazon's Elastic MapReduce.

This means the instructions at <> aren't useful for at least four reasons:

1) The latest Spark AMI is apparently still based on 0.7.
2) The bootstrap action has not tracked changes in the <> web site, so it tries to download Scala from the wrong place.
3) The bootstrap action uses a fragile regex to identify the master instance.
4) The bootstrap action attempts to copy a file that doesn't exist.

Fortunately, someone else created <>, which upgrades Spark to 0.7.2, but still suffers from the other issues. I'm attaching my patch to correct them. I've placed my bootstrap action and Spark tarball on S3, so my Elastic MapReduce CLI invocation to create my test cluster is:

./elastic-mapreduce --create --alive --name "Spark/Shark Cluster" --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop --args "--hdfs-key-value,dfs.permissions=false" --bootstrap-action s3://psnively-stuff/ --bootstrap-name "Mesos/Spark/Shark" --instance-type m1.xlarge --instance-count 3

This results in a waiting cluster.

I then follow the directions on the spark-emr GitHub site to scp up the assembly .jar and sample data. Running the first example:

java -cp /home/hadoop/spark-assembly-1-SNAPSHOT.jar \
  org.boringtechiestuff.spark.TweetWordCount /input /output
results in:

17:15:10.267 [main] INFO  spark.scheduler.DAGScheduler - Failed to run saveAsTextFile at TweetWordCount.scala:22
Exception in thread "main" spark.SparkException: Job failed: Error: Disconnected from Spark cluster
at spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:642)
at spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:640)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:60)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:640)
at spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:303)
at spark.scheduler.DAGScheduler.spark$scheduler$DAGScheduler$$run(DAGScheduler.scala:364)
at spark.scheduler.DAGScheduler$$anon$

Googling around, it seems this is a common result, but with widely varied root causes, suggesting possible room for improvement in exception-handling/fault identification. I've confirmed that /input and /output, with a _temporary subdirectory, exist in HDFS, and the various Hadoop-related logs are (so far) unhelpful.

Any advice would be greatly appreciated!

Many thanks and best regards,