hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dave Latham <lat...@davelink.net>
Subject Re: scan column families with different time ranges
Date Mon, 03 Aug 2015 18:14:43 GMT
Jean-Marc,

"Recent" is often last 24 hours or so, though if this is worked out I may
use it for other ranges as well.  Yes, currently there are weekly major
compactions, so recently compacted regions would not be able to exclude the
old store files. That's why I'm also hoping to revive some notion of tiered
compaction to keep older data in separate store files from recent data.

Dave

On Sun, Aug 2, 2015 at 6:22 AM, Jean-Marc Spaggiari <jean-marc@spaggiari.org
> wrote:

> Just thinking at loud :
> "Cutting out the old store files could well also reduce disk IO for
> that family by 100x."
>
> What is "recent"  for your data? More than 7 days?  Or less? Don't you have
> weekly major compactions?  If so and if you are scanning for  more than 7
> days,  then you will read the older files anyway, no?
>
> JM
> Le 2015-08-02 05:57, "Ted Yu" <yuzhihong@gmail.com> a écrit :
>
> > Dave:
> > I wonder if Filter response can be enhanced in the following manner:
> >
> > http://pastebin.com/sb6apTPm
> >
> > My approach is based on using essential column family (column family A in
> > your case) to guide whether the remaining column families should be
> loaded.
> > To be specific, if outside the TimeRange you specify (last day), your
> > filter returns ReturnCode.INCLUDE_AND_SEEK_NEXT_ROW.
> >
> > What do you think ?
> >
> > Cheers
> >
> > On Sat, Aug 1, 2015 at 8:06 PM, Dave Latham <latham@davelink.net> wrote:
> >
> > > Thanks for brainstorming, Ted.  That sounds like option 2 I listed
> using
> > a
> > > separate scanner for A vs B which "adds complexity to the job and gives
> > up
> > > the atomicity/consistency guarantees as new writes hit both column
> > > families".
> > >
> > > On Sat, Aug 1, 2015 at 9:07 AM, Ted Yu <yuzhihong@gmail.com> wrote:
> > >
> > > > Can you achieve your goal with two scans ?
> > > > The first scan specifies TimeRange corresponding to last day. This
> scan
> > > > returns both column families.
> > > > The other scan specifies TimeRange excluding last day. This scan
> > returns
> > > > column family A.
> > > >
> > > > Cheers
> > > >
> > > > On Sat, Aug 1, 2015 at 8:35 AM, Dave Latham <latham@davelink.net>
> > wrote:
> > > >
> > > > > Hi Ted,
> > > > >
> > > > > Thanks for the suggestion, but I'm not sure that it helps my case
> > much.
> > > > I
> > > > > wasn't very familiar with the feature, and it doesn't seem very
> well
> > > > > documented - I had to go to the source and the originating JIRA to
> > > > > understand how it works.  It sounds like it allows you to mark
> which
> > > > column
> > > > > families the filter operates on ("essential" seems an odd name).
> If
> > > any
> > > > > data from those column families passes the filter, then the scan
> > loads
> > > > and
> > > > > includes data from the remaining families without filtering it. 
In
> > my
> > > > > case, it's not clear from a row's family A whether or not family
B
> > for
> > > > that
> > > > > row is required (though that could probably be added).  Moreover,
> > even
> > > > if a
> > > > > row has recent data, we don't want to load all the old data from
> that
> > > > row.
> > > > > We'd prefer to be able to entirely skip reading the data off disk
> for
> > > the
> > > > > old store files.
> > > > >
> > > > > Dave
> > > > >
> > > > > On Sat, Aug 1, 2015 at 7:53 AM, Ted Yu <yuzhihong@gmail.com>
> wrote:
> > > > >
> > > > > > Have you considered using essential column family feature
> (through
> > > > > Filter)
> > > > > > ?
> > > > > > In your case A would be the essential column family.
> > > > > > Within TimeRange for recent data, the filter would return both
> > column
> > > > > > families.
> > > > > > Outside the TimeRange, only family A is returned.
> > > > > >
> > > > > > Cheers
> > > > > >
> > > > > > On Sat, Aug 1, 2015 at 7:17 AM, Dave Latham <latham@davelink.net
> >
> > > > wrote:
> > > > > >
> > > > > > > I have a table with 2 column families, call them A and
B, with
> > new
> > > > data
> > > > > > > regularly being added. They are very different sizes: B
is 100x
> > the
> > > > > size
> > > > > > of
> > > > > > > A.  Among other uses for this data, I have a MapReduce
job that
> > > needs
> > > > > to
> > > > > > > read all of A, but only recent data from B (e.g. last day).
> Here
> > > are
> > > > > > some
> > > > > > > methods I've considered:
> > > > > > >
> > > > > > >    1. Use a Filter to get throw out older data from B (this
is
> > > what I
> > > > > > >    currently do).  However, all the data from B still needs
to
> be
> > > > read
> > > > > > from
> > > > > > >    disk, causing a disk IO bottleneck.
> > > > > > >    2. Configure the table input format to read from B only,
> > using a
> > > > > > >    TimeRange for recent data, and have each map task open
a
> > > separate
> > > > > > > scanner
> > > > > > >    for A (without a TimeRange) then merge the data in the
map
> > task.
> > > > > > > However,
> > > > > > >    this adds complexity to the job and gives up the
> > > > > atomicity/consistency
> > > > > > >    guarantees as new writes hit both column families.
> > > > > > >    3. Add a new column family C to the table with an additional
> > > copy
> > > > of
> > > > > > the
> > > > > > >    data in B, but set a TTL on it.  All writes duplicate
the
> data
> > > > > written
> > > > > > > to B
> > > > > > >    and C.  Change the scan to include C instead of B. 
However,
> > > this
> > > > > adds
> > > > > > > all
> > > > > > >    the overhead of another column family, more writes,
and
> having
> > > to
> > > > > set
> > > > > > > the
> > > > > > >    TTL to the maximum of any time window I want to scan
> > > efficiently.
> > > > > > >    4. Implement an enhancement to HBase's Scan to allow
giving
> > each
> > > > > > column
> > > > > > >    family its own TimeRange.  The job would then be able
to
> skip
> > > most
> > > > > old
> > > > > > >    large store files (hopefully all of them with tiered
> > compaction
> > > at
> > > > > > some
> > > > > > >    point).
> > > > > > >
> > > > > > > Does anyone have other suggestions?  Would HBase be willing
to
> > > accept
> > > > > > > updating Scan to have different TimeRange's for each column
> > > families?
> > > > > > >
> > > > > > >
> > > > > > > Dave
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message