hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From java8964 <java8...@hotmail.com>
Subject Hive + Hbase scanning performance
Date Mon, 10 Feb 2014 18:56:20 GMT
I know this has been asked before. I did google around this topic and tried to understand
as much as possible, but I kind of got difference answers based on different places. So I
like to ask what I have faced and if someone can help me again on this topic.
I created one table with one column family with 20+ columns in the hive. It is populated around
150M records from a 20G csv file. What I want to check if how fast I can get for a full scan
in MR job from the Hbase table.
It is running in a 10 nodes hadoop cluster (With Hadoop 1.1.1 + Hbase 0.94.3 + Hive 0.9) ,
8 of them as Data + Task nodes, and one is NN and Hbase master, and another one is running
2nd NN.
4 nodes of 8 data nodes also run Hbase region servers.
I use the following code example to get row count from a MR job, http://hbase.apache.org/book/mapreduce.example.htmlAt
first, the mapper tasks run very slow, as I commented out the following 2 lines on purpose:
scan.setCaching(1000);        // 1 is the default in Scan, which will be bad for MapReduce
scan.setCacheBlocks(false);  // don't set to true for MR jobs
Then I added the above 2 lines, I almost get 10X faster compared to the first run. That's
good, it proved to me that above 2 lines are important for Hbase full scan.
Now the question comes to in Hive.
I already created the table in the Hive linking to the Hbase table, then I started my hive
session like this:
hive --auxpath $HIVE_HOME/lib/hive-hbase-handler-0.9.0.jar,$HIVE_HOME/lib/hbase-0.94.3.jar,$HIVE_HOME/lib/zookeeper-3.4.5.jar,$HIVE_HOME/lib/guava-r09.jar
-hiveconf hbase.master=Hbase_master:port
If I run this query "select count(*) from table", I can see the mappers performance is very
bad, almost as bad as my 1st run above.
I searched this mailing list, it looks like there is a setting in Hive session to change the
scan caching size, same as 1st line of above code base, from here:
So I add the following settings in my hive session:
set hbase.client.scanner.caching=1000;
To my surprise, after this setting in hive session, the new MR job generated from the Hive
query still very slow, same as before this settings.
Here is what I found so far:
1) In my owner MR code, before I add the 2 lines of code change or after, in the job.xml of
MR job, I both saw this setting in the job.xml:     hbase.client.scanner.caching=1    So this
setting is the same in both run, but the performance improved great after the code change.
2) In hive run, I saw the setting "hbase.client.scanner.caching" changed from 1 to 1000 in
job.xml, which is what I set in the hive session, but performance has not too much change.
So the setting was changed, but it didn't help the performance as I expected.
My questions are following:
1) Is there any change in the hive (0.9) do the same as the 1st line of code change? From
google and hbase document, it looks like the above configuration is the one, but it didn't
help me.2) Even assume the above setting is correct, why we have this Hive Jira to fix the
Hbase scan cache and marked ONLY fixed in Hive 0.12? The Jira ticket is here: https://issues.apache.org/jira/browse/HIVE-36033)
Is there any hive setting can do the same as 2nd line code change above? If so, what is it?
I google around and cannot find one.
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message