hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dave Latham <lat...@davelink.net>
Subject Re: scan column families with different time ranges
Date Sun, 02 Aug 2015 03:11:03 GMT
Thanks Andrew and Vladimir.  As Vladimir notes, it looks like it is checked
at scanner creation:
StoreScanner constructor -> getScannersNoCompaction -> selectScannersFrom
-> StoreFileScanner.shouldUseScanner -> StoreFile.passesTimerangeFilter

The StoreScanner would probably need to store the timerange for that family
separately from the scan, in the same way it keeps the set of columns.  So
it may not be too intrusive.

Vladimir, note that B is 100x larger than A, rather than the other way
round.  Cutting out the old store files could well also reduce disk IO for
that family by 100x.

On Sat, Aug 1, 2015 at 7:17 PM, Vladimir Rodionov <vladrodionov@gmail.com>
wrote:

> I think TimeRange is handled higher, when region scanner is created. With
> data size in B 100x smaller than in A, I do not understand where is a
> source of IO bottleneck?
> On Aug 1, 2015 9:16 AM, "Andrew Purtell" <apurtell@apache.org> wrote:
>
> > Hi Dave,
> >
> > >  Would HBase be willing to accept updating Scan to have different
> > TimeRange's for each column families?
> >
> > We could try it. I'm not sure how familiar you are with the relevant
> code.
> > I'm guessing some? Look at ScanQueryMatcher. This and related concerns
> > govern how we search through store files. Timerange handling is done at
> the
> > top level (the SQM). Then for each column we have a leaf tracker
> > (implementing ColumnTracker) that tracks column specific info like number
> > of versions for a cell found in each. We'd need to push timerange
> handling
> > down into the column trackers. This would be a tricky refactor on
> delicate
> > code. I suspect we could be comfortable making this change in master and
> on
> > branch-1 for upcoming unscheduled minor release line 1.3. Would that
> work?
> > Or would this change need to go further back?
> >
> > Maybe someone else has another suggestion.
> >
> >
> > On Sat, Aug 1, 2015 at 7:17 AM, Dave Latham <latham@davelink.net> wrote:
> >
> > > I have a table with 2 column families, call them A and B, with new data
> > > regularly being added. They are very different sizes: B is 100x the
> size
> > of
> > > A.  Among other uses for this data, I have a MapReduce job that needs
> to
> > > read all of A, but only recent data from B (e.g. last day).  Here are
> > some
> > > methods I've considered:
> > >
> > >    1. Use a Filter to get throw out older data from B (this is what I
> > >    currently do).  However, all the data from B still needs to be read
> > from
> > >    disk, causing a disk IO bottleneck.
> > >    2. Configure the table input format to read from B only, using a
> > >    TimeRange for recent data, and have each map task open a separate
> > > scanner
> > >    for A (without a TimeRange) then merge the data in the map task.
> > > However,
> > >    this adds complexity to the job and gives up the
> atomicity/consistency
> > >    guarantees as new writes hit both column families.
> > >    3. Add a new column family C to the table with an additional copy of
> > the
> > >    data in B, but set a TTL on it.  All writes duplicate the data
> written
> > > to B
> > >    and C.  Change the scan to include C instead of B.  However, this
> adds
> > > all
> > >    the overhead of another column family, more writes, and having to
> set
> > > the
> > >    TTL to the maximum of any time window I want to scan efficiently.
> > >    4. Implement an enhancement to HBase's Scan to allow giving each
> > column
> > >    family its own TimeRange.  The job would then be able to skip most
> old
> > >    large store files (hopefully all of them with tiered compaction at
> > some
> > >    point).
> > >
> > > Does anyone have other suggestions?  Would HBase be willing to accept
> > > updating Scan to have different TimeRange's for each column families?
> > >
> > >
> > > Dave
> > >
> >
> >
> >
> > --
> > Best regards,
> >
> >    - Andy
> >
> > Problems worthy of attack prove their worth by hitting back. - Piet Hein
> > (via Tom White)
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message