spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gimantha Bandara <>
Subject Possible "split brain" situation
Date Mon, 13 Nov 2017 07:09:20 GMT
Hi all,

We are using embedded Spark 1.6.2 in our analytics platform[1]. For the
cluster communication we use hazel-cast clustering capabilities. From
Hazelcast side we set the following configurations, in order to configure
the hearbeat properties.

So, if there is a network failure for more than 30 seconds,  Hazelcast
detects that this is a network outage. So each node sees that the other
node left the cluster (in a 2 node cluster). What happens is, the current
active node stays active and the passive master also becomes active because
of the network outage. Following are the respective messages printed by

*node 1:*
TID: [-1] [] [2017-11-05 08:59:14,071]  INFO
{org.wso2.carbon.core.clustering.hazelcast.wka.WKABasedMembershipScheme} -
Member left [ff9788ad-8570-465c-846a-82393c636ae5]: /

*node 2:*
TID: [-1] [] [2017-11-05 09:00:40,357]  INFO
{org.wso2.carbon.core.clustering.hazelcast.wka.WKABasedMembershipScheme} -
Member left [d02e49f5-78ab-4bfb-ba2c-df24f7e5c058]: /

When the network recovers, we see the following error in Spark,

ERROR {org.apache.spark.deploy.worker.Worker} -  Worker registration
failed: Duplicate worker ID {org.apache.spark.deploy.worker.Worker}

This also cause our server to shutdown(please note that we spawn a spark
JVM from our server). Please see the following shutdown error.

INFO {org.wso2.carbon.core.init.CarbonServerManager} -  Shutdown hook
triggered.... {org.wso2.carbon.core.init.CarbonServerManager}

Do you guys have any idea what is happening to a Spark cluster when there
is a network outage and how it handles the split brain situations? Is there
a way to restore the previous cluster state, once both nodes join the
cluster again? Also I am curious to know why the shudown hook is triggered.

Thanks in advance!

View raw message