hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Daniel Cryans <jdcry...@apache.org>
Subject Re: Performance characteristics of scans using timestamp as the filter
Date Thu, 06 Oct 2011 23:02:08 GMT
(super late answer, I'm cleaning up my old unread emails)

This sort of sounds like what Mozilla did for the crash reports.

The issue with your solution is when you're looking to get only a
small portion of your whole dataset you still have to go over the rest
of the data to reach it. So if you just need the daily data you're
taking a pretty big hit.

Keeping a log of modified keys sounds ok, but I'm not sure how you
plan to feed the data to MR (unless you just need the key and nothing


On Fri, Sep 9, 2011 at 11:32 AM, Leif Wickland <leifwickland@gmail.com> wrote:
> (Apologies if this has been answered before.  I couldn't find anything in
> the archives quite along these lines.)
> I have a process which writes to HBase as new data arrives.  I'd like to run
> a map-reduce periodically, say daily, that takes the new items as input.  A
> naive approach would use a scan which grabs all of the rows that have a
> timestamp in a specified interval as the input to a MapReduce.  I tested a
> scenario like that with 10s of GB of data and it seemed to perform OK.
>  Should I expected that approach to continue to perform reasonably well when
> I have TBs of data?
> From what I understand of the HBase architecture, I don't see a reason that
> the the scan approach would continue to perform well as the data grows.  It
> seems like I may have to keep a log of modified keys and use that as the
> map-reduce input, instead.
> Thanks,
> Leif Wickland

View raw message