spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ognen Duzlevski <og...@plainvanillagames.com>
Subject Re: Missing Spark URL after staring the master
Date Mon, 03 Mar 2014 21:06:53 GMT
I should add that in this setup you really do not need to look for the 
printout of the master node's IP - you set it yourself a priori. If 
anyone is interested, let me know, I can write it all up so that people 
can follow some set of instructions. Who knows, maybe I can come up with 
a set of scripts to automate it all...

Ognen


On 3/3/14, 3:02 PM, Ognen Duzlevski wrote:
> I have a Standalone spark cluster running in an Amazon VPC that I set 
> up by hand. All I did was provision the machines from a common AMI 
> image (my underlying distribution is Ubuntu), I created a "sparkuser" 
> on each machine and I have a /home/sparkuser/spark folder where I 
> downladed spark. I did this on the master only, I did sbt/sbt assemble 
> and I set up the conf/spark-env.sh to point to the master which is an 
> IP address (in my case 10.10.0.200, the port is the default 7077). I 
> also set up the slaves file in the same subdirectory to have all 16 ip 
> addresses of the worker nodes (in my case 10.10.0.201-216). After 
> sbt/sbt assembly was done on master, I then did cd ~/; tar -czf 
> spark.tgz spark/ and I copied the resulting tgz file to each worker 
> using the same "sparkuser" account and unpacked the .tgz on each slave 
> (this will effectively replicate everything from master to all slaves 
> - you can script this so you don't do it by hand).
>
> Your AMI should have the distribution's version of Java and git 
> installed by the way.
>
> All you have to do then is sparkuser@spark-master> 
> spark/sbin/start-all.sh (for 0.9, in 0.8.1 it is 
> spark/bin/start-all.sh) and it will all automagically start :)
>
> All my Amazon nodes come with 4x400 Gb of ephemeral space which I have 
> set up into a 1.6TB RAID0 array on each node and I am pooling this 
> into an HDFS filesystem which is operated by a namenode outside the 
> spark cluster while all the datanodes are the same nodes as the spark 
> workers. This enables replication and extremely fast access since 
> ephemeral is much faster than EBS or anything else on Amazon (you can 
> do even better with SSD drives on this setup but it will cost ya).
>
> If anyone is interested I can document our pipeline set up - I came up 
> with it myself and do not have a clue as to what the industry 
> standards are since I could not find any written instructions anywhere 
> online about how to set up a whole data analytics pipeline from the 
> point of ingestion to the point of analytics (people don't want to 
> share their secrets? or am I just in the dark and incapable of using 
> Google properly?). My requirement was that I wanted this to run within 
> a VPC for added security and simplicity, the Amazon security groups 
> get really old quickly. Added bonus is that you can use a VPN as an 
> entry into the whole system and your cluster instantly becomes "local" 
> to you in terms of IPs etc. I use OpenVPN since I don't like Cisco nor 
> Juniper (the only two options Amazon provides for their VPN gateways).
>
> Ognen
>
>
> On 3/3/14, 1:00 PM, Bin Wang wrote:
>> Hi there,
>>
>> I have a CDH cluster set up, and I tried using the Spark parcel come 
>> with Cloudera Manager, but it turned out they even don't have the 
>> run-example shell command in the bin folder. Then I removed it from 
>> the cluster and cloned the incubator-spark into the name node of my 
>> cluster, and built from source there successfully with everything as 
>> default.
>>
>> I ran a few examples and everything seems work fine in the local 
>> mode. Then I am thinking about scale it to my cluster, which is what 
>> the "DISTRIBUTE + ACTIVATE" command does in Cloudera Manager. I want 
>> to add all the datanodes to the slaves and think I should run Spark 
>> in the standalone mode.
>>
>> Say I am trying to set up Spark in the standalone mode following this 
>> instruction:
>> https://spark.incubator.apache.org/docs/latest/spark-standalone.html
>> However, it says "Once started, the master will print out a 
>> |spark://HOST:PORT| URL for itself, which you can use to connect 
>> workers to it, or pass as the "master" argument to |SparkContext|. 
>> You can also find this URL on the master's web UI, which is 
>> http://localhost:8080 <http://localhost:8080/> by default."
>>
>> After I started the master, there is no URL printed on the screen and 
>> neither the web UI is running.
>> Here is the output:
>> [root@box incubator-spark]# ./sbin/start-master.sh
>> starting org.apache.spark.deploy.master.Master, logging to 
>> /root/bwang_spark_new/incubator-spark/sbin/../logs/spark-root-org.apache.spark.deploy.master.Master-1-box.out
>>
>> First Question: am I even in the ballpark to run Spark in standalone 
>> mode if I try to fully utilize my cluster? I saw there are four ways 
>> to launch Spark on a cluster, AWS-EC2, Spark in standalone, Apache 
>> Meso, Hadoop Yarn... which I guess standalone mode is the way to go?
>>
>> Second Question: how to get the Spark URL of the cluster, why the 
>> output is not like what the instruction says?
>>
>> Best regards,
>>
>> Bin
>
> -- 
> Some people, when confronted with a problem, think "I know, I'll use regular expressions."
Now they have two problems.
> -- Jamie Zawinski

-- 
Some people, when confronted with a problem, think "I know, I'll use regular expressions."
Now they have two problems.
-- Jamie Zawinski


Mime
View raw message