spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Piotr Kołaczkowski <>
Subject Workers disconnected from master sometimes and never reconnect back
Date Thu, 22 May 2014 08:39:11 GMT

Another problem we observed that on a very heavily loaded cluster, if the
worker fails to respond to the heartbeat within 60 seconds, it gets
disconnected permanently from the master and never connects back again. It
is very easy to reproduce - just setup a spark standalone cluster on a
single machine, suspend it for a while and after waking up the cluster
doesn't work anymore because all workers are lost.

Is there any way to mitigate this?


Piotr Kolaczkowski, Lead Software Engineer
777 Mariners Island Blvd., Suite 510
San Mateo, CA 94404

View raw message