spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mich Talebzadeh <>
Subject Re: Spark configuration with 5 nodes
Date Fri, 11 Mar 2016 16:25:19 GMT
Hi Steve,

My argument has always been that if one is going to use Solid State Disks
(SSD), it makes sense to have it for NN disks start-up from fsimage etc.
Obviously NN lives in memory. Would you also rerommend RAID10 (mirroring &
striping) for NN disks?


Dr Mich Talebzadeh

LinkedIn *

On 11 March 2016 at 10:42, Steve Loughran <> wrote:

> On 10 Mar 2016, at 22:15, Ashok Kumar <
> <>> wrote:
> Hi,
> We intend  to use 5 servers which will be utilized for building Bigdata
> Hadoop data warehouse system (not using any propriety distribution like
> Hortonworks or Cloudera or others).
> I'd argue that life is if simpler with either of these, or bigtop+ambari
> built up yourself, for the management and monitoring tools more than
> anything else. Life is simpler if there's a web page of cluster status.
> But: DIY teaches you the internals of how things work, which is good for
> getting your hands dirty later on. Just start to automate things from the
> outset, keep configs under SCM, etc. And decide whether or not you want to
> go with Kerberos (==secure HDFS) from the outset. If you don't, put your
> cluster on a separate isolated subnet. You ought to have the boxes on a
> separate switch anyway if you can, just to avoid network traffic hurting
> anyone else on the switch.
> All servers configurations are 512GB RAM, 30TB storage and 16 cores,
> Ubuntu Linux servers. Hadoop will be installed on all the servers/nodes.
> Server 1 will be used for NameNode plus DataNode as well. Server 2 will be
> used for standby NameNode & DataNode. The rest of the servers will be used
> as DataNodes..
> 1. Make sure you've got the HDFS/NN space allocation on the two NNs set up
> so that HDFS blocks don't get into the way of the NN's needs (which ideally
> should be on a separate disk with RAID turned on);
> 2. Same for the worker nodes; temp space matters
> 3. On a small cluster, the cost of a DN failure is more significant: a
> bigger fraction of the data will go offline, recovery bandwidth limited to
> the 4 remaining boxes, etc, etc. Just be aware of that: in a bigger
> cluster, a single server is usually less traumatic. Though HDFS-599 shows
> that even facebook can lose a cluster or two.
> Now we would like to install Spark on each servers to create Spark
> cluster. Is that the good thing to do or we should buy additional hardware
> for Spark (minding cost here) or simply do we require additional memory to
> accommodate Spark as well please. In that case how much memory for each
> Spark node would you recommend?
> You should be running your compute work on the same systems as the data,
> as its the "hadoop cluster way"; locality of data ==> performance. If you
> were to buy more hardware, go for more store+compute, rather than just
> compute.
> Spark likes RAM for sharing results; less RAM == more problems. but: you
> can buy extra RAM if you need it, provided you've got space in the servers
> to put it in. Same for storage.
> Do make sure that you have ECC memory; there are some papers from google
> and microsoft on that topic if you want links to the details. Without ECC
> your data will be corrupted *and you won't even know*
> -Steve

View raw message