spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ognen Duzlevski <>
Subject Re: Missing Spark URL after staring the master
Date Mon, 03 Mar 2014 21:02:55 GMT
I have a Standalone spark cluster running in an Amazon VPC that I set up 
by hand. All I did was provision the machines from a common AMI image 
(my underlying distribution is Ubuntu), I created a "sparkuser" on each 
machine and I have a /home/sparkuser/spark folder where I downladed 
spark. I did this on the master only, I did sbt/sbt assemble and I set 
up the conf/ to point to the master which is an IP address 
(in my case, the port is the default 7077). I also set up 
the slaves file in the same subdirectory to have all 16 ip addresses of 
the worker nodes (in my case After sbt/sbt assembly 
was done on master, I then did cd ~/; tar -czf spark.tgz spark/ and I 
copied the resulting tgz file to each worker using the same "sparkuser" 
account and unpacked the .tgz on each slave (this will effectively 
replicate everything from master to all slaves - you can script this so 
you don't do it by hand).

Your AMI should have the distribution's version of Java and git 
installed by the way.

All you have to do then is sparkuser@spark-master> 
spark/sbin/ (for 0.9, in 0.8.1 it is spark/bin/ 
and it will all automagically start :)

All my Amazon nodes come with 4x400 Gb of ephemeral space which I have 
set up into a 1.6TB RAID0 array on each node and I am pooling this into 
an HDFS filesystem which is operated by a namenode outside the spark 
cluster while all the datanodes are the same nodes as the spark workers. 
This enables replication and extremely fast access since ephemeral is 
much faster than EBS or anything else on Amazon (you can do even better 
with SSD drives on this setup but it will cost ya).

If anyone is interested I can document our pipeline set up - I came up 
with it myself and do not have a clue as to what the industry standards 
are since I could not find any written instructions anywhere online 
about how to set up a whole data analytics pipeline from the point of 
ingestion to the point of analytics (people don't want to share their 
secrets? or am I just in the dark and incapable of using Google 
properly?). My requirement was that I wanted this to run within a VPC 
for added security and simplicity, the Amazon security groups get really 
old quickly. Added bonus is that you can use a VPN as an entry into the 
whole system and your cluster instantly becomes "local" to you in terms 
of IPs etc. I use OpenVPN since I don't like Cisco nor Juniper (the only 
two options Amazon provides for their VPN gateways).


On 3/3/14, 1:00 PM, Bin Wang wrote:
> Hi there,
> I have a CDH cluster set up, and I tried using the Spark parcel come 
> with Cloudera Manager, but it turned out they even don't have the 
> run-example shell command in the bin folder. Then I removed it from 
> the cluster and cloned the incubator-spark into the name node of my 
> cluster, and built from source there successfully with everything as 
> default.
> I ran a few examples and everything seems work fine in the local mode. 
> Then I am thinking about scale it to my cluster, which is what the 
> "DISTRIBUTE + ACTIVATE" command does in Cloudera Manager. I want to 
> add all the datanodes to the slaves and think I should run Spark in 
> the standalone mode.
> Say I am trying to set up Spark in the standalone mode following this 
> instruction:
> However, it says "Once started, the master will print out a 
> |spark://HOST:PORT| URL for itself, which you can use to connect 
> workers to it, or pass as the "master" argument to |SparkContext|. You 
> can also find this URL on the master's web UI, which is 
> http://localhost:8080 <http://localhost:8080/> by default."
> After I started the master, there is no URL printed on the screen and 
> neither the web UI is running.
> Here is the output:
> [root@box incubator-spark]# ./sbin/
> starting org.apache.spark.deploy.master.Master, logging to 
> /root/bwang_spark_new/incubator-spark/sbin/../logs/spark-root-org.apache.spark.deploy.master.Master-1-box.out
> First Question: am I even in the ballpark to run Spark in standalone 
> mode if I try to fully utilize my cluster? I saw there are four ways 
> to launch Spark on a cluster, AWS-EC2, Spark in standalone, Apache 
> Meso, Hadoop Yarn... which I guess standalone mode is the way to go?
> Second Question: how to get the Spark URL of the cluster, why the 
> output is not like what the instruction says?
> Best regards,
> Bin

Some people, when confronted with a problem, think "I know, I'll use regular expressions."
Now they have two problems.
-- Jamie Zawinski

View raw message