hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ian Roughley <rough...@gmail.com>
Subject Re: Hardware configuration
Date Mon, 02 May 2011 19:06:27 GMT
Sorry - I meant to answer Iulia, not Michael.  I was speaking more generally, as there is also
guarantee that MR jobs are running.  So perhaps I should add in deployment / running server


On 05/02/2011 01:47 PM, Jean-Daniel Cryans wrote:
> Ian,
> Regarding your first point, I understand where the concern is coming
> from but I'd like to point out that with the new MemStore-Local
> Allocation Buffers the full GCs taking minutes might not be as much as
> an issue as it used to be. That said, I haven't tested that out yet
> and I don't know of anyone that did it.
> Your second point is dead-on. Also not only it takes time to
> replicate, but it can also steal precious IO and in 0.20 it's pretty
> much impossible to limit the rate of re-replication.
> J-D
> On Mon, May 2, 2011 at 7:30 AM, Ian Roughley <roughley@gmail.com> wrote:
>> I think that there are two important considerations:
>> 1. Can the JVM you're planning on using support a heap of > 10GB, if not, you're
wasting money
>> 2. Putting more disk on nodes, means that a failure will take longer to re-replicate
back to it's
>> balanced state.  i.e. Given you're network topology, how long will even a 50TB machine
take, a day a
>> week, longer?
>> /Ian
>> Architect / Mgr - Novell Vibe
>> On 05/02/2011 09:57 AM, Michael Segel wrote:
>>> Hi,
>>> That's actually a really good question.
>>> Unfortunately, the answer isn't really simple.
>>> You're going to need to estimate your growth and you're going to need to estimate
your configuration.
>>> Suppose I know that within 2 years, the amount of data that I want to retain
is going to be 1PB, with a 3x replication factor, I'll need at least 3PB of disk. Assuming
that I can fit 12x2TB drives in a node, I'll need 125-150 machines. (There's some overhead
for logging and OS)
>>> Now this doesn't mean that I'll need to buy all of the machines today and build
out the cluster.
>>> It means that I will need to figure out my machine room, (rack space, power,
etc...) and also hardware configuration.
>>> You'll also need to plan out your hardware choices too. An example.. you may
want 10GBe on the switch but not at the data node. However you're going to want to be able
to expand your data nodes to be able to add 10GBe cards.
>>> The idea is that as I build out my cluster, all of the machines have the same
look and feel. So if you buy quad core CPUs and they are 2.2 GHz but 6 months from now, you
buy 2.6 GHz cpus, as long as they are 4 core cpus, your cluster will look the same.
>>> The point is that when you lay out your cluster to start with, you'll need to
plan ahead and keep things similar. Also you'll need to make sure your NameNode has enough
>>> Having said that... Yahoo! has written a paper detailing MR2 (next generation
of map/reduce).  As the M/R Job scheduler becomes more intelligent about the types of jobs
and types of hardware, the consistency of hardware becomes less important.
>>> With respect to HBase, I suspect there to be a parallel evolution.
>>> As to building out and replacing your cluster... if this is a production environment,
you'll have to think about DR and building out a second cluster. So the cost of replacing
clusters should also be factored in when you budget for hardware.
>>> Like I said, its not a simple answer and you have to approach each instance separately
and fine tune your cluster plans.
>>> HTH
>>> -Mike
>>> ----------------------------------------
>>>> Date: Mon, 2 May 2011 09:53:05 +0300
>>>> From: iulia.zidaru@1and1.ro
>>>> To: user@hbase.apache.org
>>>> CC: stack@duboce.net
>>>> Subject: Re: Hardware configuration
>>>> Thank you both. How would you estimate really big clusters, with
>>>> hundreds of nodes? Requirements might change in time and replacing an
>>>> entire cluster seems not the best solution...
>>>> On 04/29/2011 07:08 PM, Stack wrote:
>>>>> I agree with Michel Segel. Distributed computing is hard enough.
>>>>> There is no need to add extra complexity.
>>>>> St.Ack
>>>>> On Fri, Apr 29, 2011 at 4:05 AM, Iulia Zidaru wrote:
>>>>>> Hi,
>>>>>> I'm wondering if having a cluster with different machines in terms
of CPU,
>>>>>> RAM and disk space would be a big issue for HBase. For example, machines
>>>>>> with 12GBs RAM and machines with 48GBs. We suppose that we use them
at full
>>>>>> capacity. What problems we might encounter if having this kind of
>>>>>> configuration?
>>>>>> Thank you,
>>>>>> Iulia
>>>> --
>>>> Iulia Zidaru
>>>> Java Developer
>>>> 1&1 Internet AG - Bucharest/Romania - Web Components Romania
>>>> 18 Mircea Eliade St
>>>> Sect 1, Bucharest
>>>> RO Bucharest, 012015
>>>> iulia.zidaru@1and1.ro
>>>> 0040 31 223 9153

View raw message