hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <michael_se...@hotmail.com>
Subject RE: getSplits question
Date Thu, 10 Feb 2011 16:14:34 GMT

Ryan,

Just to point out the obvious...

On smaller tables where you don't get enough parallelism, you can manually force the table's
regions to be split.
My understanding that if/when the table grows it will then go back to splitting normally.


This way if you have a 'small' look up table that is relatively static, you manually split
it to the 'right' size for your cloud. 
If you are seeding a system, you can do the splits to get good parallelism and not overload
a single region with inserts, then let it go back to its normal growth pattern and splits.

This would solve the OP's issue and as you point out, not worry about getSplits().

Does this make sense, or am I missing something?

-Mike

> Date: Wed, 9 Feb 2011 23:54:19 -0800
> Subject: Re: getSplits question
> From: ryanobjc@gmail.com
> To: user@hbase.apache.org
> CC: hbase-user@hadoop.apache.org
> 
> By default each map gets the contents of 1 region. A region is by
> default a maximum of 256MB. There is no trivial way to generally
> bisect a region in half, in terms of row count, by just knowing what
> we known (start, end key).
> 
> For very large tables that have > 100 regions, this algorithm works
> really well and you get some good parallelism.  If you want to see a
> lot of parallelism out of 1 region, you might have to work a lot
> harder.  Or reduce your region size and have more regions.  Be warned
> though, that more regions has performance hits in other areas
> (specifically server startup/shutdown/assignment times).  So you
> probably dont want 50,000 32MB regions.
> 
> -ryan
> 
> On Wed, Feb 9, 2011 at 11:46 PM, Geoff Hendrey <ghendrey@decarta.com> wrote:
> > Oh, I definitely don't *need* my own to run mapreduce. However, if I want to control
the number of records handled by each mapper (splitsize) and the startrow and endrow, then
I thought I had to write my own getSplits(). Is there another way to accomplish this, because
I do need the combination of controlled splitsize and start/endrow.
> >
> > -geoff
> >
> > -----Original Message-----
> > From: Ryan Rawson [mailto:ryanobjc@gmail.com]
> > Sent: Wednesday, February 09, 2011 11:43 PM
> > To: user@hbase.apache.org
> > Cc: hbase-user@hadoop.apache.org
> > Subject: Re: getSplits question
> >
> > You shouldn't need to write your own getSplits() method to run a map
> > reduce, I never did at least...
> >
> > -ryan
> >
> > On Wed, Feb 9, 2011 at 11:36 PM, Geoff Hendrey <ghendrey@decarta.com> wrote:
> >> Are endrows inclusive or exclusive? The docs say exclusive, but then the
> >> question arises as to how to form the last split for getSplits(). The
> >> code below runs fine, but I believe it is omitting some rows, perhaps
> >> b/c of the exclusive end row. For the final split, should the endrow be
> >> null? I tried that, and got what appeared to be a final split without an
> >> endrow at all. Would appreciate a pointer to the correct implementation
> >> of getSplits in which I desire to provide a startrow, endrow, and
> >> splitsize. Apparently this isn't it J :
> >>
> >>
> >>
> >> int splitSize = context.getConfiguration().getInt("splitsize", 1000);
> >>
> >>                byte[] splitStop = null;
> >>
> >>                String hostname = null;
> >>
> >>                while ((results = resultScanner.next(splitSize)).length
> >>> 0) {
> >>
> >>                    //   System.out.println("results
> >> :-------------------------- "+results);
> >>
> >>                    byte[] splitStart = results[0].getRow();
> >>
> >>                    splitStop = results[results.length - 1].getRow();
> >> //I think this is a problem...we don't actually include this row in the
> >> split since it's exclusive..revisit this and correct
> >>
> >>                    HRegionLocation location =
> >> table.getRegionLocation(splitStart);
> >>
> >>                    hostname =
> >> location.getServerAddress().getHostname();
> >>
> >>                    InputSplit split = new
> >> TableSplit(table.getTableName(), splitStart, splitStop, hostname);
> >>
> >>                    splits.add(split);
> >>
> >>                    System.out.println("initializing splits: " +
> >> split.toString());
> >>
> >>                }
> >>
> >>                resultScanner.close();
> >>
> >>
> >>
> >>
> >>
> >> -g
> >>
> >>
> >
 		 	   		  
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message