hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From yonghu <yongyong...@gmail.com>
Subject Re: Poor HBase map-reduce scan performance
Date Wed, 05 Jun 2013 14:55:21 GMT
Can anyone explain why client + rpc + server will decrease the performance
of scanning? I mean the Regionserver and Tasktracker are the same node when
you use MapReduce to scan the HBase table. So, in my understanding, there
will be no rpc cost.

Thanks!

Yong


On Wed, Jun 5, 2013 at 10:09 AM, Sandy Pratt <prattrs@adobe.com> wrote:

> https://issues.apache.org/jira/browse/HBASE-8691
>
>
> On 6/4/13 6:11 PM, "Sandy Pratt" <prattrs@adobe.com> wrote:
>
> >Haven't had a chance to write a JIRA yet, but I thought I'd pop in here
> >with an update in the meantime.
> >
> >I tried a number of different approaches to eliminate latency and
> >"bubbles" in the scan pipeline, and eventually arrived at adding a
> >streaming scan API to the region server, along with refactoring the scan
> >interface into an event-drive message receiver interface.  In so doing, I
> >was able to take scan speed on my cluster from 59,537 records/sec with the
> >classic scanner to 222,703 records per second with my new scan API.
> >Needless to say, I'm pleased ;)
> >
> >More details forthcoming when I get a chance.
> >
> >Thanks,
> >Sandy
> >
> >On 5/23/13 3:47 PM, "Ted Yu" <yuzhihong@gmail.com> wrote:
> >
> >>Thanks for the update, Sandy.
> >>
> >>If you can open a JIRA and attach your producer / consumer scanner there,
> >>that would be great.
> >>
> >>On Thu, May 23, 2013 at 3:42 PM, Sandy Pratt <prattrs@adobe.com> wrote:
> >>
> >>> I wrote myself a Scanner wrapper that uses a producer/consumer queue to
> >>> keep the client fed with a full buffer as much as possible.  When
> >>>scanning
> >>> my table with scanner caching at 100 records, I see about a 24% uplift
> >>>in
> >>> performance (~35k records/sec with the ClientScanner and ~44k
> >>>records/sec
> >>> with my P/C scanner).  However, when I set scanner caching to 5000,
> >>>it's
> >>> more of a wash compared to the standard ClientScanner: ~53k records/sec
> >>> with the ClientScanner and ~60k records/sec with the P/C scanner.
> >>>
> >>> I'm not sure what to make of those results.  I think next I'll shut
> >>>down
> >>> HBase and read the HFiles directly, to see if there's a drop off in
> >>> performance between reading them directly vs. via the RegionServer.
> >>>
> >>> I still think that to really solve this there needs to be sliding
> >>>window
> >>> of records in flight between disk and RS, and between RS and client.
> >>>I'm
> >>> thinking there's probably a single batch of records in flight between
> >>>RS
> >>> and client at the moment.
> >>>
> >>> Sandy
> >>>
> >>> On 5/23/13 8:45 AM, "Bryan Keller" <bryanck@gmail.com> wrote:
> >>>
> >>> >I am considering scanning a snapshot instead of the table. I believe
> >>>this
> >>> >is what the ExportSnapshot class does. If I could use the scanning
> >>>code
> >>> >from ExportSnapshot then I will be able to scan the HDFS files
> >>>directly
> >>> >and bypass the regionservers. This could potentially give me a huge
> >>>boost
> >>> >in performance for full table scans. However, it doesn't really
> >>>address
> >>> >the poor scan performance against a table.
> >>>
> >>>
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message