hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Rawson <ryano...@gmail.com>
Subject Re: Fast retrieval of multiple rows with non-sequential keys
Date Mon, 05 Oct 2009 22:05:54 GMT
Another thing to consider is to restructure your keys to create data
locality. One of the advantages of hbase is you can create that data
locality very easily just by changing row keys.

In the end, if you are doing lots of random reads, we are talking disk
seeks. There is no data storage system that provides fast access to
this pattern.  Heavy caching is what most people end up doing.

SSD is becoming a more of a option as prices go down.

good luck!

On Mon, Oct 5, 2009 at 7:06 AM, Jochen Frey <jochen_frey@yahoo.com> wrote:
> Thanks JG.
>
> I'll check out JIRA and educate myself.
>
> If I had my wish - I'd get the results streamed back to me, so that I can
> start work on the results while they're being retrieved.
>
> :-)
>
> J
>
> On Oct 5, 2009, at 3:36 PM, Jonathan Gray wrote:
>
>> This is being worked on.  Ideally, a solution would batch things by region
>> and then by regionserver, so that the total number of RPC calls would at a
>> maximum be the number of servers.
>>
>> Follow HBASE-1845 and related issues.
>>
>> You can use threads and add some parallelism of the multiple gets in your
>> application for now.
>>
>> JG
>>
>> On Mon, October 5, 2009 3:02 am, Jochen Frey wrote:
>>>
>>> I want to use HBase as a BLOB store for a search engine application.
>>> For that the objects will be stored in one HBase table (~ 1B rows).
>>> Object size is typically between 1kB to 20kB.
>>>
>>>
>>> I am concerned about my read pattern, where our typical read retrieve
>>> between tens and thousands of rows in random order. Looking at the Java
>>> API
>>> the only method to retrieve rows in random order is to issue multiple
>>>
>>> Result = HTable.get(Get)
>>>
>>>
>>> requests sequentially (I assume a Scanner is not a good idea since the
>>> rows are need are spread randomly across the table / regions / etc.).
>>>
>>> My concern is that with that pattern I have one rpc call per item,
>>> which seems to be a lot of overhead, especially when I need to retrieve
>>> 100s or 1,000s of rows.
>>>
>>>
>>> Would it not be preferable to batch up requests so that all rows
>>> requested would be grouped by region, and then send off in parallel to
>>> regions for retrieval - that way there'd be fewer RPC calls, and they
>>> could be executed in parallel, as well? As such an addition to the
>>> interface could look something like
>>>
>>> List<Result> = HTable.get(List<Get>)
>>>
>>>
>>> Am I making sense? Is there something that I am missing?
>>>
>>>
>>> Thanks!
>>> Jochen
>>>
>>>
>>>
>>>
>>
>
> ---
> m: jochen_frey@yahoo.com
> p: +1.415.706.1341
>
>

Mime
View raw message