hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Daniel Cryans <jdcry...@apache.org>
Subject Re: Hive+HBase performance is much poorer than Hive+HDFS
Date Wed, 12 Oct 2011 02:53:46 GMT
This is one big factor and you didn't mention configuring it:


On Tue, Oct 11, 2011 at 7:47 PM, Weihua JIANG <weihua.jiang@gmail.com>wrote:

> Hi all,
> I have made some perf test about Hive+HBase. The table is a normal 2D
> table with about 160M rows (each row with 7 small columns) and 32
> regions. There is only one column family and all regions have been
> major compacted to one store file before test.
> On a cluster with 11 task trackers (each with 4 map slots and 1 reduce
> slot, these servers also act as region servers), a simple SQL in Hive
>   select count(*) from table where column3='Y';
> needs ~1700 seconds to finish.
> But, after use CTAS statement to create an internal table (stored as
> sequence file), this statement only needs 43 seconds to finish.
> So Hive+HBase is 40X slower than Hive+HDFS.
> Though Hive+HBase has less map tasks (32 vs 223), but since there are
> only 44 map slots available, I don't think it is the main cause.
> I studied the source code of HBase scan implementation. To me, it
> seems, in my case, the scan performs HFile read in a quite similar way
> as sequence file read (sequential reading of each key/value pair). So,
> in theory, the performance shall be quite similar.
> Can anyone explain the 40X slowdown?
> Thanks
> Weihua

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message