hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Daniel Cryans <jdcry...@apache.org>
Subject Re: Hive+HBase performance is much poorer than Hive+HDFS
Date Thu, 13 Oct 2011 17:25:34 GMT
Your question is more basic than that, it's actually how much slower is it
to sequentially read in HBase compared to HDFS. I'm not sure anyone
quantified that, and there's probably a bunch of factors that can influence
it, but at least you should try to get the same level of distribution eg
since you have less regions than mapper slots, force split that table once
or twice to get more of them. The difference here is due to the fact that
regions can get up to 256MB by default before splitting whereas in HDFS the
default block size is 64MB.

Then maybe your HBase schema isn't efficient (fat keys), but I wouldn't be
able to tell just by what you wrote.

In any case, since you have to go through an additional layer, it will
definitely be slower to use HBase than directly reading the files.

J-D

On Thu, Oct 13, 2011 at 1:53 AM, Weihua JIANG <weihua.jiang@gmail.com>wrote:

> After set this argument to 1000, I get a result: hive/hbase is 4X
> slower than hive/hdfs.
>
> how much X is the expected slowdown for hive/hbase vs hive/hdfs?
>
> Thanks
> Weihua
>
> 2011/10/12 Akash Ashok <thehellmaker@gmail.com>:
> > Hi,
> > To set this parameter you could use "set
> hbase.client.scanner.caching=500;"
> > before the execution of your hive query.
> >
> > Cheers,
> > Akash
> >
> > On Wed, Oct 12, 2011 at 8:34 AM, Weihua JIANG <weihua.jiang@gmail.com
> >wrote:
> >
> >> Since I am using Hive to perform query, I don't know how to set it.
> >> Can you tell me how to do so?
> >>
> >> Thanks
> >> Weihua
> >>
> >> 2011/10/12 Jean-Daniel Cryans <jdcryans@apache.org>:
> >> > This is one big factor and you didn't mention configuring it:
> >> > http://hbase.apache.org/book.html#perf.hbase.client.caching
> >> >
> >> > J-D
> >> >
> >> > On Tue, Oct 11, 2011 at 7:47 PM, Weihua JIANG <weihua.jiang@gmail.com
> >> >wrote:
> >> >
> >> >> Hi all,
> >> >>
> >> >> I have made some perf test about Hive+HBase. The table is a normal
2D
> >> >> table with about 160M rows (each row with 7 small columns) and 32
> >> >> regions. There is only one column family and all regions have been
> >> >> major compacted to one store file before test.
> >> >>
> >> >> On a cluster with 11 task trackers (each with 4 map slots and 1
> reduce
> >> >> slot, these servers also act as region servers), a simple SQL in Hive
> >> >>   select count(*) from table where column3='Y';
> >> >> needs ~1700 seconds to finish.
> >> >>
> >> >> But, after use CTAS statement to create an internal table (stored as
> >> >> sequence file), this statement only needs 43 seconds to finish.
> >> >>
> >> >> So Hive+HBase is 40X slower than Hive+HDFS.
> >> >>
> >> >> Though Hive+HBase has less map tasks (32 vs 223), but since there are
> >> >> only 44 map slots available, I don't think it is the main cause.
> >> >>
> >> >> I studied the source code of HBase scan implementation. To me, it
> >> >> seems, in my case, the scan performs HFile read in a quite similar
> way
> >> >> as sequence file read (sequential reading of each key/value pair).
> So,
> >> >> in theory, the performance shall be quite similar.
> >> >>
> >> >> Can anyone explain the 40X slowdown?
> >> >>
> >> >> Thanks
> >> >> Weihua
> >> >>
> >> >
> >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message