spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Roque Vassal'lo (JIRA)" <j...@apache.org>
Subject [jira] [Created] (SPARK-5497) start-all script not working properly on Standalone HA cluster (with Zookeeper)
Date Fri, 30 Jan 2015 09:10:34 GMT
Roque Vassal'lo created SPARK-5497:
--------------------------------------

             Summary: start-all script not working properly on Standalone HA cluster (with
Zookeeper)
                 Key: SPARK-5497
                 URL: https://issues.apache.org/jira/browse/SPARK-5497
             Project: Spark
          Issue Type: Bug
          Components: Deploy
    Affects Versions: 1.2.0
            Reporter: Roque Vassal'lo


I have configured a Standalone HA cluster with Zookeeper with:
- 3 Zookeeper nodes
- 2 Spark master nodes (1 alive and 1 in standby mode)
- 2 Spark slave nodes

While executing start-all.sh on each master, it will start the master and start a worker on
each configured slave.
If alive master goes down, those worker are supposed to reconfigure themselves to use the
new active master automatically.

I have noticed that the spark-env property SPARK_MASTER_IP is used in both called scripts,
start-master and start-slaves.

The problem is that if you configure SPARK_MASTER_IP with the active master ip, when it goes
down, workers don't reassign themselves to the new active master.
And if you configure SPARK_MASTER_IP with the masters cluster route (well, an approximation,
because you have to write master's port in all-but-last ips, that is "master1:7077,master2",
in order to make it work), slaves start properly but master doesn't.

So, the start-master script needs SPARK_MASTER_IP property to contain its ip in order to start
master properly; and start-slaves script needs SPARK_MASTER_IP property to contain the masters
cluster ips (that is "master1:7077,master2")

To test that idea, I have modified start-slaves and spark-env scripts on master nodes.
On spark-env.sh, I have set SPARK_MASTER_IP property to master's own ip on each master node
(that is, on master node 1, SPARK_MASTER_IP=master1; and on master node 2, SPARK_MASTER_IP=master2)
On spark-env.sh, I have added a new property SPARK_MASTER_CLUSTER_IP with the pseudo-masters-cluster-ips
(SPARK_MASTER_CLUSTER_IP=master1:7077,master2) on both masters.
On start-slaves.sh, I have modified all references to SPARK_MASTER_IP to SPARK_MASTER_CLUSTER_IP.
I have tried that and it works great! When active master node goes down, all workers reassign
themselves to the new active node.

Maybe there is a better fix for this issue.
Hope this quick-fix idea can help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message