spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From AlexG <>
Subject what is cause of, and how to recover from, unresponsive nodes w/ spark-ec2 script
Date Wed, 12 Aug 2015 23:28:26 GMT
I'm using the spark-ec2 script to launch a 30 node r3.8xlarge cluster.
Occasionally several nodes will become unresponsive: I will notice that hdfs
complains it can't find some blocks, then when I go to restart hadoop, the
messages indicate that the connection to some nodes timed out, then when I
check, I can't ssh into those nodes at all.

Is this a problem others have experienced? What is causing this random
failure--- or where can I look to find relevant logs---, and how can I
recover from this other than to destroy the cluster and start anew
(time-consuming, tedious, and requiring that I pull down my large dataset
from S3 to HDFS once again, but this is what I've been doing currently)?

View this message in context:
Sent from the Apache Spark User List mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message