hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bryan Keller <brya...@gmail.com>
Subject Re: Using Scans in parallel
Date Sun, 09 Oct 2011 17:59:24 GMT
I was not able to get consistent results using multiple scanners in parallel on a table. I
implemented a counter test that used 8 scanners in parallel on a table with 2m rows with 2k+
columns each, and the results were not consistent. There were no errors thrown, but the count
was off by as much as 2%. Using a single thread gave the same (correct) result every run.

I tried various approaches, such as creating an HTable and opening a connection per thread,
but I was not able to get stable results. I would do some testing before using parallel scanners
as described here.

On Oct 5, 2011, at 10:11 PM, lars hofhansl wrote:

> That's part of it, the other part is to get the region demarcations.
> You can also just get the smallest and largest key of the table and pick other demarcations
for your scans. Then your individual scans will likely cover multiple regions and regionservers.
> Your threading model depends on your needs. If you interested in lowest latency you want
to keep your regionservers busy for each query.
> What exactly that means depends on your setup. Maybe you split up the overall scan so
that no more than N scans are active at any regionserver.
> If you're more interested in overall predictability, you might not want parallelize each
scan too much.
> ----- Original Message -----
> From: Sam Seigal <selekt86@yahoo.com>
> To: user@hbase.apache.org; lars hofhansl <lhofhansl@yahoo.com>
> Cc: "hbase-user@hadoop.apache.org" <hbase-user@hadoop.apache.org>
> Sent: Wednesday, October 5, 2011 6:18 PM
> Subject: Re: Using Scans in parallel
> So the whole point of getting the region locations is to ensure that
> there is one thread per region server ?
> On Wed, Oct 5, 2011 at 4:42 PM, lars hofhansl <lhofhansl@yahoo.com> wrote:
>> Hi Sam,
>> There were some attempts to build this in. In the end I think the exact patterns
are different based on what one is trying to achieve.
>> Currently what you can do is getting all the region locations (HTable.getRegionLocations).
From the HRegionInfos you can
>> get the regions start and end keys.
>> Now you can issue parallel scan for as many regions as you want (by create a Scan
object with start and row set to the region's
>> start and end key).
>> You probably want to group the regions by regionserver and have one thread per region
server, or something.
>> -- Lars
>> ________________________________
>> From: Sam Seigal <selekt86@yahoo.com>
>> To: hbase-user@hadoop.apache.org
>> Sent: Wednesday, October 5, 2011 4:29 PM
>> Subject: Using Scans in parallel
>> Hi ,
>> Is there a known way to be able to do Scan's in parallel (in different
>> threads even) and then sort/combine the output ?
>> For a row key like:
>> prefix-event_type-event_id
>> prefix-event_type-event_id
>> I want to declare two scan objects (for say event_id_type foo)
>> Scan 1 =>  0-foo
>> Scan 2 =>  1-foo
>> execute the scans in parallel (maybe even in different threads) and
>> then merge the results ?
>> Thank you,
>> Sam

View raw message