hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Daniel Cryans <jdcry...@apache.org>
Subject Re: Question on region server/data node restart
Date Wed, 25 Feb 2009 13:29:50 GMT
Correction, I was suggesting 0.18.2 (the svn branch) since it has many fixes
that Michael would need and it won't break anything for him (as 0.19.0 will
do with MR jobs).

J-D

On Wed, Feb 25, 2009 at 1:33 AM, stack <stack@duboce.net> wrote:

> Michael, as J-D suggests above, can you update to 0.19.0 hbase?  Its better
> about all of this stuff -- though not as reactive as 0.20.0 will be.
> St.Ack
>
> On Tue, Feb 24, 2009 at 8:33 AM, Michael Dagaev <michael.dagaev@gmail.com
> >wrote:
>
> > No problem :)
> >
> > On Tue, Feb 24, 2009 at 6:30 PM, Jean-Daniel Cryans <jdcryans@apache.org
> >
> > wrote:
> > > Ok so that region server must have been holding .META., you will have
> to
> > > restart HBase.
> > >
> > > Sorry
> > >
> > > J-D
> > >
> > > On Tue, Feb 24, 2009 at 11:27 AM, Michael Dagaev
> > > <michael.dagaev@gmail.com>wrote:
> > >
> > >> Sorry, I mean that some requests fail when a region server is down in
> > >> Hbase 0.18.1,
> > >> which we are using now.
> > >>
> > >> Besides, when I started the stopped region server and stopped another
> > one,
> > >> not only "old" requests were stuck because of retries but new requests
> > >> (e.g.
> > >> issued by hbase shell) fail too.
> > >>
> > >> The master.jsp also fails with
> > >>
> > >> Trying to contact region server <...>:60020 for region .META.,,1,
row
> > >> '', but failed after 10 attempts.
> > >> Exceptions: java.io.IOException: Call failed on local exception
> > >>
> > >> Thank you for your cooperation,
> > >> M.
> > >>
> > >> On Tue, Feb 24, 2009 at 6:06 PM, Jean-Daniel Cryans <
> > jdcryans@apache.org>
> > >> wrote:
> > >> > As I wrote, you should upgrade to 0.18 branch in SVN.
> > >> >
> > >> > J-D
> > >> >
> > >> > On Tue, Feb 24, 2009 at 11:04 AM, Michael Dagaev
> > >> > <michael.dagaev@gmail.com>wrote:
> > >> >
> > >> >> I do not if it was holding ROOT or META region.
> > >> >> It looks like requests may fail in Hbase 0.18 if a region server
> > stops.
> > >> >>
> > >> >> Thanks,
> > >> >> M.
> > >> >>
> > >> >> On Tue, Feb 24, 2009 at 5:40 PM, Jean-Daniel Cryans <
> > >> jdcryans@apache.org>
> > >> >> wrote:
> > >> >> > Well this should not happen like that. Was the region server
> > holding
> > >> the
> > >> >> > ROOT or META region? If so, well that's a bug corrected in
0.19.0
> > and
> > >> >> > branch-0.18. I suggest you upgrade to that version if you
don't
> > want
> > >> to
> > >> >> > break your MR jobs.
> > >> >> >
> > >> >> > J-D
> > >> >> >
> > >> >> > On Tue, Feb 24, 2009 at 10:33 AM, Michael Dagaev
> > >> >> > <michael.dagaev@gmail.com>wrote:
> > >> >> >
> > >> >> >> What I see now is that the client gets an exception (see
below)
> > once
> > >> a
> > >> >> >> region servers stops:
> > >> >> >>
> > >> >> >> org.apache.hadoop.hbase.client.NoServerForRegionException:
No
> > server
> > >> >> >> address listed in .META.
> > >> >> >> ...
> > >> >> >> Caused by:
> > org.apache.hadoop.hbase.client.RetriesExhaustedException:
> > >> >> >> Trying to contact region server <region server>:60020
for region
> > >> >> >>
> > >> >> >> I guess the exception occurred since the region server
is down.
> Is
> > it
> > >> >> >> correct?
> > >> >> >>
> > >> >> >> Thank you for your cooperation,
> > >> >> >> M.
> > >> >> >>
> > >> >> >> P. S. We are running version 0.18.1
> > >> >> >>
> > >> >> >> On Tue, Feb 24, 2009 at 5:07 PM, Jean-Daniel Cryans <
> > >> >> jdcryans@apache.org>
> > >> >> >> wrote:
> > >> >> >> > Correcting myself, no waiting time regards the time
to figure
> > the
> > >> node
> > >> >> is
> > >> >> >> > dead. It will still have to fetch the region location
in META.
> > >> >> >> >
> > >> >> >> > J-D
> > >> >> >> >
> > >> >> >> >
> > >> >> >> > On Tue, Feb 24, 2009 at 10:02 AM, Jean-Daniel Cryans
<
> > >> >> >> jdcryans@apache.org>wrote:
> > >> >> >> >
> > >> >> >> >> Well if a region server dies instead of being
cleanly shut
> > down,
> > >> it
> > >> >> >> takes
> > >> >> >> >> in the worst case 180 seconds (a region server
lease length)
> > >> before
> > >> >> the
> > >> >> >> >> Master reassigns the regions. Clients trying
to connect to
> that
> > >> >> server
> > >> >> >> will
> > >> >> >> >> take IIRC 10 seconds to figure the node is down
then the time
> > to
> > >> >> >> communicate
> > >> >> >> >> with ROOT and META is under 1 sec. If META wasn't
updated
> yet,
> > it
> > >> >> will
> > >> >> >> retry
> > >> >> >> >> all of that.
> > >> >> >> >>
> > >> >> >> >> In the next release (0.20.0), the master is
notified by
> > Zookeeper
> > >> in
> > >> >> the
> > >> >> >> >> following seconds of a region server death and
will proceed
> to
> > >> >> reassign
> > >> >> >> the
> > >> >> >> >> regions immediately.
> > >> >> >> >>
> > >> >> >> >> If the client don't have the region in cache
and META is
> > updated
> > >> with
> > >> >> >> the
> > >> >> >> >> region server death, there will be no waiting
time.
> > >> >> >> >>
> > >> >> >> >> J-D
> > >> >> >> >>
> > >> >> >> >>
> > >> >> >> >> On Tue, Feb 24, 2009 at 9:49 AM, Michael Dagaev
<
> > >> >> >> michael.dagaev@gmail.com>wrote:
> > >> >> >> >>
> > >> >> >> >>> Thanks, now it is clear.
> > >> >> >> >>>
> > >> >> >> >>> However, if a region server is down, it
takes a lot of time
> to
> > >> retry
> > >> >> >> >>> first,
> > >> >> >> >>> to rescan the META region when the retries
fail, rescan
> ROOT,
> > >> etc.
> > >> >> to
> > >> >> >> >>> get eventually to another region server,
which will handle
> the
> > >> >> request.
> > >> >> >> >>> Is it correct ?
> > >> >> >> >>>
> > >> >> >> >>> On Tue, Feb 24, 2009 at 4:36 PM, Jean-Daniel
Cryans <
> > >> >> >> jdcryans@apache.org>
> > >> >> >> >>> wrote:
> > >> >> >> >>> > This is why we have a META table, it
holds the location
> > info.
> > >> See
> > >> >> >> >>> >
> > http://wiki.apache.org/hadoop/Hbase/HbaseArchitecture#client
> > >> >> >> >>> >
> > >> >> >> >>> > J-D
> > >> >> >> >>> >
> > >> >> >> >>> > On Tue, Feb 24, 2009 at 9:28 AM, Michael
Dagaev <
> > >> >> >> >>> michael.dagaev@gmail.com>wrote:
> > >> >> >> >>> >
> > >> >> >> >>> >> Thanks, Jean-Daniel.
> > >> >> >> >>> >>
> > >> >> >> >>> >> I did run hbase-daemon stop regionserver
and start
> > >> regionserver
> > >> >> >> >>> >> and saw the client retrying to
connect to the restarted
> > region
> > >> >> >> server.
> > >> >> >> >>> >>
> > >> >> >> >>> >> How does it know to connect to
another region server ?
> > Maybe
> > >> it
> > >> >> >> stops
> > >> >> >> >>> >> retrying, asks master, and get
another region server to
> > >> connect
> > >> >> to.
> > >> >> >> >>> >> Is it correct ?
> > >> >> >> >>> >>
> > >> >> >> >>> >> Thank you for your cooperation,
> > >> >> >> >>> >> M.
> > >> >> >> >>> >>
> > >> >> >> >>> >> On Tue, Feb 24, 2009 at 3:56 PM,
Jean-Daniel Cryans <
> > >> >> >> >>> jdcryans@apache.org>
> > >> >> >> >>> >> wrote:
> > >> >> >> >>> >> > Michael,
> > >> >> >> >>> >> >
> > >> >> >> >>> >> > Regards stopping those nodes,
do it using
> > >> >> >> hadoop-daemon/hbase-daemon
> > >> >> >> >>> to
> > >> >> >> >>> >> stop
> > >> >> >> >>> >> > them cleanly. Requests from
the clients will not
> "fail",
> > >> they
> > >> >> will
> > >> >> >> >>> simply
> > >> >> >> >>> >> be
> > >> >> >> >>> >> > told to look elsewhere for
the regions they have in
> > cache.
> > >> >> Unless
> > >> >> >> you
> > >> >> >> >>> >> only
> > >> >> >> >>> >> > have 1 region server...
> > >> >> >> >>> >> >
> > >> >> >> >>> >> > Regards starting the nodes,
apart from the usual
> > >> >> >> >>> >> hadoop-daemon/hbase-daemon,
> > >> >> >> >>> >> > no.
> > >> >> >> >>> >> >
> > >> >> >> >>> >> > J-D
> > >> >> >> >>> >> >
> > >> >> >> >>> >> > On Tue, Feb 24, 2009 at 8:50
AM, Michael Dagaev <
> > >> >> >> >>> >> michael.dagaev@gmail.com>wrote:
> > >> >> >> >>> >> >
> > >> >> >> >>> >> >> Hi, all
> > >> >> >> >>> >> >>
> > >> >> >> >>> >> >>     As I understand, I
can stop a region server and a
> > data
> > >> >> node
> > >> >> >> in a
> > >> >> >> >>> >> >> cluster
> > >> >> >> >>> >> >> "semi-transparently" for
clients, i. e. the requests
> > >> handled
> > >> >>  by
> > >> >> >> the
> > >> >> >> >>> >> >> region server
> > >> >> >> >>> >> >> at that time will fail,
but cluster will be working.
> > >> >> >> >>> >> >>
> > >> >> >> >>> >> >> If I start the data node
and region server  I do not
> > have
> > >> to
> > >> >> do
> > >> >> >> >>> anything
> > >> >> >> >>> >> to
> > >> >> >> >>> >> >> make
> > >> >> >> >>> >> >> them work.
> > >> >> >> >>> >> >>
> > >> >> >> >>> >> >> Is it correct ?
> > >> >> >> >>> >> >>
> > >> >> >> >>> >> >> Thank you for your cooperation,
> > >> >> >> >>> >> >> M.
> > >> >> >> >>> >> >>
> > >> >> >> >>> >> >
> > >> >> >> >>> >>
> > >> >> >> >>> >
> > >> >> >> >>>
> > >> >> >> >>
> > >> >> >> >>
> > >> >> >> >
> > >> >> >>
> > >> >> >
> > >> >>
> > >> >
> > >>
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message