hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: Hive + Hbase scanning performance
Date Mon, 10 Feb 2014 19:24:51 GMT
You can patch HIVE-3603 into your deployment so that you can make use of


On Mon, Feb 10, 2014 at 10:56 AM, java8964 <java8964@hotmail.com> wrote:

> Hi,
> I know this has been asked before. I did google around this topic and
> tried to understand as much as possible, but I kind of got difference
> answers based on different places. So I like to ask what I have faced and
> if someone can help me again on this topic.
> I created one table with one column family with 20+ columns in the hive.
> It is populated around 150M records from a 20G csv file. What I want to
> check if how fast I can get for a full scan in MR job from the Hbase table.
> It is running in a 10 nodes hadoop cluster (With Hadoop 1.1.1 + Hbase
> 0.94.3 + Hive 0.9) , 8 of them as Data + Task nodes, and one is NN and
> Hbase master, and another one is running 2nd NN.
> 4 nodes of 8 data nodes also run Hbase region servers.
> I use the following code example to get row count from a MR job,
> http://hbase.apache.org/book/mapreduce.example.htmlAt first, the mapper
> tasks run very slow, as I commented out the following 2 lines on purpose:
> scan.setCaching(1000);        // 1 is the default in Scan, which will be
> bad for MapReduce jobs
> scan.setCacheBlocks(false);  // don't set to true for MR jobs
> Then I added the above 2 lines, I almost get 10X faster compared to the
> first run. That's good, it proved to me that above 2 lines are important
> for Hbase full scan.
> Now the question comes to in Hive.
> I already created the table in the Hive linking to the Hbase table, then I
> started my hive session like this:
> hive --auxpath
> $HIVE_HOME/lib/hive-hbase-handler-0.9.0.jar,$HIVE_HOME/lib/hbase-0.94.3.jar,$HIVE_HOME/lib/zookeeper-3.4.5.jar,$HIVE_HOME/lib/guava-r09.jar
> -hiveconf hbase.master=Hbase_master:port
> If I run this query "select count(*) from table", I can see the mappers
> performance is very bad, almost as bad as my 1st run above.
> I searched this mailing list, it looks like there is a setting in Hive
> session to change the scan caching size, same as 1st line of above code
> base, from here:
> http://mail-archives.apache.org/mod_mbox/hbase-user/201110.mbox/%3CCAGpTDNfn11jZAJ2mfboEqkfudXaU9HGsY4b=2x1spWf4qMUvyw@mail.gmail.com%3E
> So I add the following settings in my hive session:
> set hbase.client.scanner.caching=1000;
> To my surprise, after this setting in hive session, the new MR job
> generated from the Hive query still very slow, same as before this settings.
> Here is what I found so far:
> 1) In my owner MR code, before I add the 2 lines of code change or after,
> in the job.xml of MR job, I both saw this setting in the job.xml:
> hbase.client.scanner.caching=1    So this setting is the same in both run,
> but the performance improved great after the code change.
> 2) In hive run, I saw the setting "hbase.client.scanner.caching" changed
> from 1 to 1000 in job.xml, which is what I set in the hive session, but
> performance has not too much change. So the setting was changed, but it
> didn't help the performance as I expected.
> My questions are following:
> 1) Is there any change in the hive (0.9) do the same as the 1st line of
> code change? From google and hbase document, it looks like the above
> configuration is the one, but it didn't help me.2) Even assume the above
> setting is correct, why we have this Hive Jira to fix the Hbase scan cache
> and marked ONLY fixed in Hive 0.12? The Jira ticket is here:
> https://issues.apache.org/jira/browse/HIVE-36033) Is there any hive
> setting can do the same as 2nd line code change above? If so, what is it? I
> google around and cannot find one.
> Thanks
> Yong

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message