We are having the same problem. We're running Spark 0.9.1 in standalone mode and on some heavy jobs workers become unresponsive and marked by master as dead, even though the worker process is still running. Then they never join the cluster again and cluster becomes essentially unusable until we restart each worker.
We'd like to know:
1. Why worker can become unresponsive? Are there any well known config / usage pitfalls that we could have fallen into? We're still investigating the issue, but maybe there are some hints?
2. Is there an option to auto-recover a worker? e.g. automatically start a new one if the old one failed? or at least some hooks to implement functionality liek that?