spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gimantha Bandara <giman...@wso2.com>
Subject Possible "split brain" situation
Date Mon, 13 Nov 2017 07:09:20 GMT
Hi all,

We are using embedded Spark 1.6.2 in our analytics platform[1]. For the
cluster communication we use hazel-cast clustering capabilities. From
Hazelcast side we set the following configurations, in order to configure
the hearbeat properties.

hazelcast.max.no.heartbeat.seconds=30
hazelcast.max.no.master.confirmation.seconds=45

So, if there is a network failure for more than 30 seconds,  Hazelcast
detects that this is a network outage. So each node sees that the other
node left the cluster (in a 2 node cluster). What happens is, the current
active node stays active and the passive master also becomes active because
of the network outage. Following are the respective messages printed by
hazelcast.

*node 1:*
TID: [-1] [] [2017-11-05 08:59:14,071]  INFO
{org.wso2.carbon.core.clustering.hazelcast.wka.WKABasedMembershipScheme} -
Member left [ff9788ad-8570-465c-846a-82393c636ae5]: /10.36.239.70:4000
{org.wso2.carbon.core.clustering.hazelcast.wka.WKABasedMembershipScheme}

*node 2:*
TID: [-1] [] [2017-11-05 09:00:40,357]  INFO
{org.wso2.carbon.core.clustering.hazelcast.wka.WKABasedMembershipScheme} -
Member left [d02e49f5-78ab-4bfb-ba2c-df24f7e5c058]: /10.36.239.67:4000
{org.wso2.carbon.core.clustering.hazelcast.wka.WKABasedMembershipScheme}


When the network recovers, we see the following error in Spark,

ERROR {org.apache.spark.deploy.worker.Worker} -  Worker registration
failed: Duplicate worker ID {org.apache.spark.deploy.worker.Worker}


This also cause our server to shutdown(please note that we spawn a spark
JVM from our server). Please see the following shutdown error.

INFO {org.wso2.carbon.core.init.CarbonServerManager} -  Shutdown hook
triggered.... {org.wso2.carbon.core.init.CarbonServerManager}

Do you guys have any idea what is happening to a Spark cluster when there
is a network outage and how it handles the split brain situations? Is there
a way to restore the previous cluster state, once both nodes join the
cluster again? Also I am curious to know why the shudown hook is triggered.

Thanks in advance!
Gimantha

Mime
View raw message