hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edward Capriolo <edlinuxg...@gmail.com>
Subject Re: Thousands of tablesq
Date Fri, 30 Jul 2010 17:16:47 GMT
On Fri, Jul 30, 2010 at 12:41 PM, Jean-Daniel Cryans
<jdcryans@apache.org> wrote:
>> I see. Usually a whole customer fits within a region. Actually, the
>> number of customers that doesn't fit in a single region are only two or three.
>> But then another question comes up. Even if a put all the data in a single
>> table, given that the keys are written in order, and given that several
>> customers can fit in the same region, I'd had the exact same problem right?
>> I mean, if data from customer A to D sits in the same region within the same
>> table, the result is worse than having 4 different tables, as those can actually
>> sit in another region server right?
>> Is there a way to move a region manually to another machine?
> If you expect that some contiguous rows would be really overused, then
> change the row key. UUIDs for example would spread them all over the
> regions.
> In 0.20 you can do a close_region in the shell, that will move the
> region to the first region servers that checks. In 0.90 we are working
> on better load balancing, more properly tuned to region traffic.
>>> Client side? I don't believe so, there's almost nothing kept in memory.
>> Even if all the htables are opened at the same time?
> The only connections kept are with region servers, not with "tables",
> so even if you have 1k of them it's just 999 more objects to keep in
> memory (compared to the single table design). If you are afraid that
> it would be too much, you can shard the web servers per clients. In
> any case, why not test it?
> J-D

Usually people have gone to the "table per customer" approach in the
RDMS world would did this
because their Database did not offer built in partitioning or they
wanted to offer Quality Of Service type features such as
"high paying customers go to new fancy servers".

I feel this approach is somewhat contradictory do the scaling model.
It also introduces the issue of managing the change across X
instances. Which cloud serving systems the schema is typically more
simplistic but replicating changes X times could still be an issue.

There exists a hybrid approach which I borrow from hive bucketing.
Rather then make one partition for each customer, bucket those
customer by calculating a hash id mod 64. Customer distributed
randomly across the 64 buckets and by randomness small customers and
large customers balance out.

I do not like "table per customer" or the bucket idea I introduced for
noSQL, I see it causing X times the pressure on the NameNode, I see it
causing x times the work in all over monitoring, your application
servers will now be caching X HTable connections (does not seem

"Early optimization is the root of much evil"

View raw message