spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Snively <psniv...@icloud.com>
Subject Spark 0.7.3 on Amazon Elastic MapReduce
Date Tue, 06 Aug 2013 17:41:19 GMT
Hi everyone!

I'm working on a proof-of-concept using Spark that requires the streaming fixes for Kafka
that appeared in 0.7.3 and must run in Amazon's Elastic MapReduce.

This means the instructions at <http://aws.amazon.com/articles/4926593393724923> aren't
useful for at least four reasons:

1) The latest Spark AMI is apparently still based on 0.7.
2) The bootstrap action has not tracked changes in the <http://scala-lang.org> web site,
so it tries to download Scala from the wrong place.
3) The bootstrap action uses a fragile regex to identify the master instance.
4) The bootstrap action attempts to copy a file that doesn't exist.

Fortunately, someone else created <https://github.com/daithiocrualaoich/spark-emr>,
which upgrades Spark to 0.7.2, but still suffers from the other issues. I'm attaching my patch
to correct them. I've placed my bootstrap action and Spark tarball on S3, so my Elastic MapReduce
CLI invocation to create my test cluster is:

./elastic-mapreduce --create --alive --name "Spark/Shark Cluster" --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop
--args "--hdfs-key-value,dfs.permissions=false" --bootstrap-action s3://psnively-stuff/install-spark-0.7.3-bootstrap.sh
--bootstrap-name "Mesos/Spark/Shark" --instance-type m1.xlarge --instance-count 3

This results in a waiting cluster.

I then follow the directions on the spark-emr GitHub site to scp up the assembly .jar and
sample data. Running the first example:
java -cp /home/hadoop/spark-assembly-1-SNAPSHOT.jar \
  org.boringtechiestuff.spark.TweetWordCount /input /output
results in:

17:15:10.267 [main] INFO  spark.scheduler.DAGScheduler - Failed to run saveAsTextFile at TweetWordCount.scala:22
Exception in thread "main" spark.SparkException: Job failed: Error: Disconnected from Spark
cluster
	at spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:642)
	at spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:640)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:60)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
	at spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:640)
	at spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:303)
	at spark.scheduler.DAGScheduler.spark$scheduler$DAGScheduler$$run(DAGScheduler.scala:364)
	at spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:107)

Googling around, it seems this is a common result, but with widely varied root causes, suggesting
possible room for improvement in exception-handling/fault identification. I've confirmed that
/input and /output, with a _temporary subdirectory, exist in HDFS, and the various Hadoop-related
logs are (so far) unhelpful.

Any advice would be greatly appreciated!

Many thanks and best regards,
Paul


Mime
View raw message