spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From AlexG <swift...@gmail.com>
Subject what is cause of, and how to recover from, unresponsive nodes w/ spark-ec2 script
Date Wed, 12 Aug 2015 23:28:26 GMT
I'm using the spark-ec2 script to launch a 30 node r3.8xlarge cluster.
Occasionally several nodes will become unresponsive: I will notice that hdfs
complains it can't find some blocks, then when I go to restart hadoop, the
messages indicate that the connection to some nodes timed out, then when I
check, I can't ssh into those nodes at all.

Is this a problem others have experienced? What is causing this random
failure--- or where can I look to find relevant logs---, and how can I
recover from this other than to destroy the cluster and start anew
(time-consuming, tedious, and requiring that I pull down my large dataset
from S3 to HDFS once again, but this is what I've been doing currently)?






--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/what-is-cause-of-and-how-to-recover-from-unresponsive-nodes-w-spark-ec2-script-tp24235.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message