hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jianshi Huang <jianshi.hu...@gmail.com>
Subject Re: One-table w/ multi-CF or multi-table w/ one-CF?
Date Sat, 06 Sep 2014 18:34:39 GMT
Each range might span multiple regions, depending on the data size I want
scan for MR jobs.

The ranges are dynamic, specified by the user, but the number of bins can
be static (when the table/schema is created).

Jianshi


On Sun, Sep 7, 2014 at 2:23 AM, Ted Yu <yuzhihong@gmail.com> wrote:

> bq. 16 to 256 ranges
>
> Would each range be within single region or the range may span regions ?
> Are the ranges dynamic ?
>
> Using command line for multiple ranges would be out of question. A file
> with ranges is needed.
>
> Cheers
>
>
> On Sat, Sep 6, 2014 at 11:18 AM, Jianshi Huang <jianshi.huang@gmail.com>
> wrote:
>
> > Thanks Ted for the reference.
> >
> > That's right, extend the row.start and row.end to specify multiple ranges
> > and also getSplits.
> >
> > I would probably bin the event sequence CF into 16 to 256 bins. So 16 to
> > 256 ranges.
> >
> > Jianshi
> >
> >
> >
> > On Sun, Sep 7, 2014 at 2:09 AM, Ted Yu <yuzhihong@gmail.com> wrote:
> >
> > > Please refer to HBASE-5416 Filter on one CF and if a match, then load
> and
> > > return full row
> > >
> > > bq. to extend TableInputFormat to accept multiple row ranges
> > >
> > > You mean extending hbase.mapreduce.scan.row.start and
> > > hbase.mapreduce.scan.row.stop so that multiple ranges can be specified
> ?
> > > How many such ranges do you normally need ?
> > >
> > > Cheers
> > >
> > >
> > > On Sat, Sep 6, 2014 at 11:01 AM, Jianshi Huang <
> jianshi.huang@gmail.com>
> > > wrote:
> > >
> > > > Thanks Ted,
> > > >
> > > > I'll pre-split the table during ingestion. The reason to keep the
> > rowkey
> > > > monotonic is for easier working with TableInputFormat, otherwise I
> > > would've
> > > > binned it into 256 splits. (well, I think a good way is to extend
> > > > TableInputFormat to accept multiple row ranges, if there's an
> existing
> > > > efficient implementation, please let me know :)
> > > >
> > > > Would you elaborate a little more on the heap memory usage during
> scan?
> > > Is
> > > > there any reference to that?
> > > >
> > > > Jianshi
> > > >
> > > >
> > > >
> > > > On Sun, Sep 7, 2014 at 1:20 AM, Ted Yu <yuzhihong@gmail.com> wrote:
> > > >
> > > > > If you use monotonically increasing rowkeys, separating out the
> > column
> > > > > family into a new table would give you same issue you're facing
> > today.
> > > > >
> > > > > Using a single table, essential column family feature would reduce
> > the
> > > > > amount of heap memory used during scan. With two tables, there is
> no
> > > such
> > > > > facility.
> > > > >
> > > > > Cheers
> > > > >
> > > > >
> > > > > On Sat, Sep 6, 2014 at 10:11 AM, Jianshi Huang <
> > > jianshi.huang@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi Ted,
> > > > > >
> > > > > > Yes, that's the table having RegionTooBusyExceptions :) But
the
> > > > > performance
> > > > > > I care most are scan performance.
> > > > > >
> > > > > > It's mostly for analytics, so I don't care much about atomicity
> > > > > currently.
> > > > > >
> > > > > > What's your suggestion?
> > > > > >
> > > > > > Jianshi
> > > > > >
> > > > > >
> > > > > > On Sun, Sep 7, 2014 at 1:08 AM, Ted Yu <yuzhihong@gmail.com>
> > wrote:
> > > > > >
> > > > > > > Is this the same table you mentioned in the thread about
> > > > > > > RegionTooBusyException
> > > > > > > ?
> > > > > > >
> > > > > > > If you move the column family to another table, you may
have to
> > > > handle
> > > > > > > atomicity yourself - currently atomic operations are within
> > region
> > > > > > > boundaries.
> > > > > > >
> > > > > > > Cheers
> > > > > > >
> > > > > > >
> > > > > > > On Sat, Sep 6, 2014 at 9:49 AM, Jianshi Huang <
> > > > jianshi.huang@gmail.com
> > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > I'm currently putting everything into one table (to
make
> cross
> > > > > > reference
> > > > > > > > queries easier) and there's one CF which contains
rowkeys
> very
> > > > > > different
> > > > > > > to
> > > > > > > > the rest. Currently it works well, but I'm wondering
if it
> will
> > > > cause
> > > > > > > > performance issues in the future.
> > > > > > > >
> > > > > > > > So my questions are
> > > > > > > >
> > > > > > > > 1) will there be performance penalties in the way
I'm doing?
> > > > > > > > 2) should I move that CF to a separate table?
> > > > > > > >
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > --
> > > > > > > > Jianshi Huang
> > > > > > > >
> > > > > > > > LinkedIn: jianshi
> > > > > > > > Twitter: @jshuang
> > > > > > > > Github & Blog: http://huangjs.github.com/
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Jianshi Huang
> > > > > >
> > > > > > LinkedIn: jianshi
> > > > > > Twitter: @jshuang
> > > > > > Github & Blog: http://huangjs.github.com/
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Jianshi Huang
> > > >
> > > > LinkedIn: jianshi
> > > > Twitter: @jshuang
> > > > Github & Blog: http://huangjs.github.com/
> > > >
> > >
> >
> >
> >
> > --
> > Jianshi Huang
> >
> > LinkedIn: jianshi
> > Twitter: @jshuang
> > Github & Blog: http://huangjs.github.com/
> >
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message