hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Varun Sharma <va...@pinterest.com>
Subject Re: Optimizing Multi Gets in hbase
Date Tue, 19 Feb 2013 06:45:17 GMT
I am actually more concerned about multiple gets within a region. I think
if random rows within a region are accessed, it should always be one scan
instead of doing one scan per get (just like we do for the
BulkDeleteEndpoint). Wouldn't that always be faster ?

On Mon, Feb 18, 2013 at 5:48 PM, lars hofhansl <larsh@apache.org> wrote:

> As it happens we did some tests around last week.
> Turns out doing Gets in batches instead of a scan still gives you 1/3 of
> the performance.
> I.e. when you have a table with, say, 10m rows and scanning take N
> seconds, then calling 10m Gets in batches of 1000 take ~3N, which is pretty
> impressive.
> Now, this is with all data in the cache!
> When the data is not in the cache and the Gets are random it is many
> orders of magnitude slower, as the Gets are sprayed all over the disk. In
> that case sorting the Gets and issuing scans would indeed be much more
> efficient.
> The Gets in a batch are already sorted on the client, but as N. says it is
> hard to determine when to turn many Gets into a Scan with filters
> automatically. Without statistics/histograms I'd even wager a guess that
> would be impossible to do.
> Imagine you issue 10000 random Gets, but your table has 10bn rows, in that
> case it is almost certain that the Gets are faster than a scan.
> Now image the Gets only cover a small key range. With statistics we could
> tell whether it would beneficial to turn this into a scan.
> It's not that hard to add statistics to HBase. Would do it as part of the
> compactions, and record the histograms in some table.
> You can always do that yourself. If you suspect you are touching most rows
> in a table/region, just issue a scan with a appropriate filter (may have to
> implement your own filter, though). Maybe we could a version of RowFilter
> that match against multiple keys.
> -- Lars
> ________________________________
>  From: Varun Sharma <varun@pinterest.com>
> To: user@hbase.apache.org
> Sent: Monday, February 18, 2013 1:57 AM
> Subject: Optimizing Multi Gets in hbase
> Hi,
> I am trying to batched get(s) on a cluster. Here is the code:
> List<Get> gets = ...
> // Prepare my gets with the rows i need
> myHTable.get(gets);
> I have two questions about the above scenario:
> i) Is this the most optimal way to do this ?
> ii) I have a feeling that if there are multiple gets in this case, on the
> same region, then each one of those shall instantiate separate scan(s) over
> the region even though a single scan is sufficient. Am I mistaken here ?
> Thanks
> Varun

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message