hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Whiting <je...@qualtrics.com>
Subject Re: Speeding up Scans
Date Wed, 25 Jan 2012 18:09:09 GMT
Does it make sense to have better defaults so the performance out of the box is better?

~Jeff

On 1/25/2012 8:06 AM, Peter Wolf wrote:
> Ah ha!  I appear to be insane ;-)
>
> Adding the following speeded things up quite a bit
>
>         scan.setCacheBlocks(true);
>         scan.setCaching(1000);
>
> Thank you, it was a duh!
>
> P
>
>
>
> On 1/25/12 8:13 AM, Doug Meil wrote:
>> Hi there-
>>
>> Quick sanity check:  what caching level are you using?  (default is 1)  I
>> know this is basic, but it's always good to double-check.
>>
>> If "language" is already in the lead position of the rowkey, why use the
>> filter?
>>
>> As for EC2, that's a wildcard.
>>
>>
>>
>>
>>
>> On 1/25/12 7:56 AM, "Peter Wolf"<opus111@gmail.com>  wrote:
>>
>>> Hello all,
>>>
>>> I am looking for advice on speeding up my Scanning.
>>>
>>> I want to iterate over all rows where a particular column (language)
>>> equals a particular value ("JA").
>>>
>>> I am already creating my row keys using that column in the first bytes.
>>> And I do my scans using partial row matching, like this...
>>>
>>>      public static byte[] calculateStartRowKey(String language) {
>>>          int languageHash = language.length()>  0 ? language.hashCode() :
>>> 0;
>>>          byte[] language2 = Bytes.toBytes(languageHash);
>>>          byte[] accountID2 = Bytes.toBytes(0);
>>>          byte[] timestamp2 = Bytes.toBytes(0);
>>>          return Bytes.add(Bytes.add(language2, accountID2), timestamp2);
>>>      }
>>>
>>>      public static byte[] calculateEndRowKey(String language) {
>>>          int languageHash = language.length()>  0 ? language.hashCode() :
>>> 0;
>>>          byte[] language2 = Bytes.toBytes(languageHash + 1);
>>>          byte[] accountID2 = Bytes.toBytes(0);
>>>          byte[] timestamp2 = Bytes.toBytes(0);
>>>          return Bytes.add(Bytes.add(language2, accountID2), timestamp2);
>>>      }
>>>
>>>      Scan scan = new Scan(calculateStartRowKey(language),
>>> calculateEndRowKey(language));
>>>
>>>
>>> Since I am using a hash value for the string, I need to re-check the
>>> column to make sure that some other string does not get the same hash
>>> value
>>>
>>>      Filter filter = new SingleColumnValueFilter(resultFamily,
>>> languageCol, CompareFilter.CompareOp.EQUAL, Bytes.toBytes(language));
>>>      scan.setFilter(filter);
>>>
>>> I am using the Cloudera 0.09.4 release, and a cluster of 3 machines on
>>> EC2.
>>>
>>> I think that this should be really fast, but it is not.  Any advice on
>>> how to debug/speed it up?
>>>
>>> Thanks
>>> Peter
>>>
>>>
>>>
>>>
>>>
>>
>

-- 
Jeff Whiting
Qualtrics Senior Software Engineer
jeffw@qualtrics.com


Mime
View raw message