hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sandy Pratt <prat...@adobe.com>
Subject Re: Poor HBase map-reduce scan performance
Date Wed, 05 Jun 2013 08:09:22 GMT

On 6/4/13 6:11 PM, "Sandy Pratt" <prattrs@adobe.com> wrote:

>Haven't had a chance to write a JIRA yet, but I thought I'd pop in here
>with an update in the meantime.
>I tried a number of different approaches to eliminate latency and
>"bubbles" in the scan pipeline, and eventually arrived at adding a
>streaming scan API to the region server, along with refactoring the scan
>interface into an event-drive message receiver interface.  In so doing, I
>was able to take scan speed on my cluster from 59,537 records/sec with the
>classic scanner to 222,703 records per second with my new scan API.
>Needless to say, I'm pleased ;)
>More details forthcoming when I get a chance.
>On 5/23/13 3:47 PM, "Ted Yu" <yuzhihong@gmail.com> wrote:
>>Thanks for the update, Sandy.
>>If you can open a JIRA and attach your producer / consumer scanner there,
>>that would be great.
>>On Thu, May 23, 2013 at 3:42 PM, Sandy Pratt <prattrs@adobe.com> wrote:
>>> I wrote myself a Scanner wrapper that uses a producer/consumer queue to
>>> keep the client fed with a full buffer as much as possible.  When
>>> my table with scanner caching at 100 records, I see about a 24% uplift
>>> performance (~35k records/sec with the ClientScanner and ~44k
>>> with my P/C scanner).  However, when I set scanner caching to 5000,
>>> more of a wash compared to the standard ClientScanner: ~53k records/sec
>>> with the ClientScanner and ~60k records/sec with the P/C scanner.
>>> I'm not sure what to make of those results.  I think next I'll shut
>>> HBase and read the HFiles directly, to see if there's a drop off in
>>> performance between reading them directly vs. via the RegionServer.
>>> I still think that to really solve this there needs to be sliding
>>> of records in flight between disk and RS, and between RS and client.
>>> thinking there's probably a single batch of records in flight between
>>> and client at the moment.
>>> Sandy
>>> On 5/23/13 8:45 AM, "Bryan Keller" <bryanck@gmail.com> wrote:
>>> >I am considering scanning a snapshot instead of the table. I believe
>>> >is what the ExportSnapshot class does. If I could use the scanning
>>> >from ExportSnapshot then I will be able to scan the HDFS files
>>> >and bypass the regionservers. This could potentially give me a huge
>>> >in performance for full table scans. However, it doesn't really
>>> >the poor scan performance against a table.

View raw message