hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amandeep Khurana <ama...@gmail.com>
Subject Re: getSplits() in TableInputFormatBase
Date Sun, 11 Apr 2010 09:10:17 GMT
3 tables? are you counting root and meta also?


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Sun, Apr 11, 2010 at 1:57 AM, john smith <js1987.smith@gmail.com> wrote:

> From the web interface...
>
>
> number of regions =5
> number of tables = 3
>
> Thanks
>
>
> On Sun, Apr 11, 2010 at 2:23 PM, Amandeep Khurana <amansk@gmail.com>
> wrote:
>
> > How many regions do you have?
> >
> >
> > Amandeep Khurana
> > Computer Science Graduate Student
> > University of California, Santa Cruz
> >
> >
> > On Sun, Apr 11, 2010 at 1:39 AM, john smith <js1987.smith@gmail.com>
> > wrote:
> >
> > > Amandeep ,
> > >
> > > Thanks for the explanation . What is the default value to the num of
> maps
> > ?
> > > Is it not equal to the num of regions ?
> > >
> > > Right now I am running HBase in pseudo distributed mode . If I set num
> of
> > > map tasks to 100000 (some big num)..
> > >
> > > I get numSplits=1
> > >
> > > If I dont set any thing .. numSplits =2;
> > >
> > >
> > > Can you explain this.
> > >
> > > Thanks
> > > j.S
> > >
> > > On Sun, Apr 11, 2010 at 1:50 PM, Amandeep Khurana <amansk@gmail.com>
> > > wrote:
> > >
> > > > If you set the number of map tasks as a higher number than the number
> > of
> > > > regions (I generally set it to 100000 or something like that), the
> > number
> > > > of
> > > > splits = number of regions. If you keep it lower, then it combines
> > > regions
> > > > in a single split.
> > > >
> > > >
> > > > Amandeep Khurana
> > > > Computer Science Graduate Student
> > > > University of California, Santa Cruz
> > > >
> > > >
> > > > On Sun, Apr 11, 2010 at 1:15 AM, john smith <js1987.smith@gmail.com>
> > > > wrote:
> > > >
> > > > > Amandeep,
> > > > >
> > > > > I guess that is not true ,.. See the explanation as in docs ..
> > > > >
> > > > >
> > > > > "Splits are created in number equal to the smallest between
> numSplits
> > > and
> > > > > the number of HRegion<
> > > > >
> > > >
> > >
> >
> http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/regionserver/HRegion.html
> > > > > >s
> > > > > in the table. If the number of splits is smaller than the number
of
> > > > > HRegion<
> > > > >
> > > >
> > >
> >
> http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/regionserver/HRegion.html
> > > > > >s
> > > > > then splits are spanned across multiple
> > > > > HRegion<
> > > > >
> > > >
> > >
> >
> http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/regionserver/HRegion.html
> > > > > >s
> > > > > and are grouped the most evenly possible. In the case splits are
> > uneven
> > > > the
> > > > > bigger splits are placed first in the InputSplit array.  "
> > > > >
> > > > >
> > > > > depending on whether numSplits < (or >)  num of regions ..
it
> choses
> > > real
> > > > > number of splits and the same is done in the code
> > > > >
> > > > > // Code
> > > > >  int realNumSplits = numSplits > startKeys.length?
> startKeys.length:
> > > > > numSplits;
> > > > >
> > > > > Here startKeys.length is the number of regions...
> > > > >
> > > > > Am I true?
> > > > >
> > > > > Thanks
> > > > > j.S
> > > > >
> > > > >
> > > > >
> > > > > On Sun, Apr 11, 2010 at 1:33 PM, Amandeep Khurana <
> amansk@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > The number of splits is equal to the number of regions...
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Sun, Apr 11, 2010 at 12:54 AM, john smith <
> > js1987.smith@gmail.com
> > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hi ,
> > > > > > >
> > > > > > > In the method  "public org.apache.hadoop.mapred.InputSplit[]
> > > > > *getSplits*
> > > > > > > (org.apache.hadoop.mapred.JobConf job,
> > > > > > >
> > > > > > >                                                       int
> > > numSplits)
> > > > "
> > > > > > >
> > > > > > > how is the "numSplits" decided ? I've seen differnt values
of
> > > > > > > numSplits for different MR jobs . Any reason for this ?
> > > > > > >
> > > > > > > Also what if I ignore numsplits and always split at region
> > > > > > > boundaries.I guess that , splitting at region boundaries
makes
> > more
> > > > > > > sense and improves some what data locality.
> > > > > > >
> > > > > > > Any comments on the above statement?
> > > > > > >
> > > > > > > Thanks
> > > > > > >
> > > > > > > j.S
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message