hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amit Mor <amit.mor.m...@gmail.com>
Subject Re: RS crash upon replication
Date Wed, 22 May 2013 21:22:06 GMT
 va-p-hbase-02-d,60020,1369249862401


On Thu, May 23, 2013 at 12:20 AM, Varun Sharma <varun@pinterest.com> wrote:

> Basically
>
> ls /hbase/rs and what do you see for va-p-02-d ?
>
>
> On Wed, May 22, 2013 at 2:19 PM, Varun Sharma <varun@pinterest.com> wrote:
>
> > Can you do ls /hbase/rs and see what you get for 02-d - instead of
> looking
> > in /replication/, could you look in /hbase/replication/rs - I want to see
> > if the timestamps are matching or not ?
> >
> > Varun
> >
> >
> > On Wed, May 22, 2013 at 2:17 PM, Varun Sharma <varun@pinterest.com>
> wrote:
> >
> >> I see - so looks okay - there's just a lot of deep nesting in there - if
> >> you look into these you nodes by doing ls - you should see a bunch of
> >> WAL(s) which still need to be replicated...
> >>
> >> Varun
> >>
> >>
> >> On Wed, May 22, 2013 at 2:16 PM, Varun Sharma <varun@pinterest.com
> >wrote:
> >>
> >>> 2013-05-22 15:31:25,929 WARN
> >>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
> transient
> >>> ZooKeeper exception:
> >>> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >>> KeeperErrorCode = Session expired for *
> >>>
> /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719
> >>> *
> >>> *
> >>> *
> >>> *01->[01->02->02]->01*
> >>>
> >>> *Looks like a bunch of cascading failures causing this deep nesting...
> *
> >>>
> >>>
> >>> On Wed, May 22, 2013 at 2:09 PM, Amit Mor <amit.mor.mail@gmail.com
> >wrote:
> >>>
> >>>> empty return:
> >>>>
> >>>> [zk: va-p-zookeeper-01-c:2181(CONNECTED) 10] ls
> >>>> /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
> >>>> []
> >>>>
> >>>>
> >>>>
> >>>> On Thu, May 23, 2013 at 12:05 AM, Varun Sharma <varun@pinterest.com>
> >>>> wrote:
> >>>>
> >>>> > Do an "ls" not a get here and give the output ?
> >>>> >
> >>>> > ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
> >>>> >
> >>>> >
> >>>> > On Wed, May 22, 2013 at 1:53 PM, amit.mor.mail@gmail.com <
> >>>> > amit.mor.mail@gmail.com> wrote:
> >>>> >
> >>>> > > [zk: va-p-zookeeper-01-c:2181(CONNECTED) 3] get
> >>>> > > /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
> >>>> > >
> >>>> > > cZxid = 0x60281c1de
> >>>> > > ctime = Wed May 22 15:11:17 EDT 2013
> >>>> > > mZxid = 0x60281c1de
> >>>> > > mtime = Wed May 22 15:11:17 EDT 2013
> >>>> > > pZxid = 0x60281c1de
> >>>> > > cversion = 0
> >>>> > > dataVersion = 0
> >>>> > > aclVersion = 0
> >>>> > > ephemeralOwner = 0x0
> >>>> > > dataLength = 0
> >>>> > > numChildren = 0
> >>>> > >
> >>>> > >
> >>>> > >
> >>>> > > On Wed, May 22, 2013 at 11:49 PM, Ted Yu <yuzhihong@gmail.com>
> >>>> wrote:
> >>>> > >
> >>>> > > > What does this command show you ?
> >>>> > > >
> >>>> > > > get /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
> >>>> > > >
> >>>> > > > Cheers
> >>>> > > >
> >>>> > > > On Wed, May 22, 2013 at 1:46 PM, amit.mor.mail@gmail.com
<
> >>>> > > > amit.mor.mail@gmail.com> wrote:
> >>>> > > >
> >>>> > > > > ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379
> >>>> > > > > [1]
> >>>> > > > > [zk: va-p-zookeeper-01-c:2181(CONNECTED) 2] ls
> >>>> > > > > /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
> >>>> > > > > []
> >>>> > > > >
> >>>> > > > > I'm on hbase-0.94.2-cdh4.2.1
> >>>> > > > >
> >>>> > > > > Thanks
> >>>> > > > >
> >>>> > > > >
> >>>> > > > > On Wed, May 22, 2013 at 11:40 PM, Varun Sharma <
> >>>> varun@pinterest.com>
> >>>> > > > > wrote:
> >>>> > > > >
> >>>> > > > > > Also what version of HBase are you running
?
> >>>> > > > > >
> >>>> > > > > >
> >>>> > > > > > On Wed, May 22, 2013 at 1:38 PM, Varun Sharma
<
> >>>> varun@pinterest.com
> >>>> > >
> >>>> > > > > wrote:
> >>>> > > > > >
> >>>> > > > > > > Basically,
> >>>> > > > > > >
> >>>> > > > > > > You had va-p-hbase-02 crash - that caused
all the
> >>>> replication
> >>>> > > related
> >>>> > > > > > data
> >>>> > > > > > > in zookeeper to be moved to va-p-hbase-01
and have it take
> >>>> over
> >>>> > for
> >>>> > > > > > > replicating 02's logs. Now each region
server also
> >>>> maintains an
> >>>> > > > > in-memory
> >>>> > > > > > > state of whats in ZK, it seems like when
you start up 01,
> >>>> its
> >>>> > > trying
> >>>> > > > to
> >>>> > > > > > > replicate the 02 logs underneath but its
failing to
> because
> >>>> that
> >>>> > > data
> >>>> > > > > is
> >>>> > > > > > > not in ZK. This is somewhat weird...
> >>>> > > > > > >
> >>>> > > > > > > Can you open the zookeepeer shell and
do
> >>>> > > > > > >
> >>>> > > > > > > ls
> /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379
> >>>> > > > > > >
> >>>> > > > > > > And give the output ?
> >>>> > > > > > >
> >>>> > > > > > >
> >>>> > > > > > > On Wed, May 22, 2013 at 1:27 PM, amit.mor.mail@gmail.com<
> >>>> > > > > > > amit.mor.mail@gmail.com> wrote:
> >>>> > > > > > >
> >>>> > > > > > >> Hi,
> >>>> > > > > > >>
> >>>> > > > > > >> This is bad ... and happened twice:
I had my
> >>>> replication-slave
> >>>> > > > cluster
> >>>> > > > > > >> offlined. I performed quite a massive
Merge operation on
> >>>> it and
> >>>> > > > after
> >>>> > > > > a
> >>>> > > > > > >> couple of hours it had finished and
I returned it back
> >>>> online.
> >>>> > At
> >>>> > > > the
> >>>> > > > > > same
> >>>> > > > > > >> time, the replication-master RS machines
crashed (see
> first
> >>>> > crash
> >>>> > > > > > >> http://pastebin.com/1msNZ2tH) with
the first exception
> >>>> being:
> >>>> > > > > > >>
> >>>> > > > > > >> org.apache.zookeeper.KeeperException$NoNodeException:
> >>>> > > > KeeperErrorCode
> >>>> > > > > =
> >>>> > > > > > >> NoNode for
> >>>> > > > > > >>
> >>>> > > > > > >>
> >>>> > > > > >
> >>>> > > > >
> >>>> > > >
> >>>> > >
> >>>> >
> >>>>
> /hbase/replication/rs/va-p-hbase-01-c,60020,1369233253404/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719
> >>>> > > > > > >>         at
> >>>> > > > > > >>
> >>>> > > >
> >>>> org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
> >>>> > > > > > >>         at
> >>>> > > > > > >>
> >>>> > >
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> >>>> > > > > > >>         at
> >>>> > > > org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1266)
> >>>> > > > > > >>         at
> >>>> > > > > > >>
> >>>> > > > > > >>
> >>>> > > > > >
> >>>> > > > >
> >>>> > > >
> >>>> > >
> >>>> >
> >>>>
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:354)
> >>>> > > > > > >>         at
> >>>> > > > > > >>
> >>>> > org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:846)
> >>>> > > > > > >>         at
> >>>> > > > > > >>
> >>>> > org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:898)
> >>>> > > > > > >>         at
> >>>> > > > > > >>
> >>>> > org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:892)
> >>>> > > > > > >>         at
> >>>> > > > > > >>
> >>>> > > > > > >>
> >>>> > > > > >
> >>>> > > > >
> >>>> > > >
> >>>> > >
> >>>> >
> >>>>
> org.apache.hadoop.hbase.replication.ReplicationZookeeper.writeReplicationStatus(ReplicationZookeeper.java:558)
> >>>> > > > > > >>         at
> >>>> > > > > > >>
> >>>> > > > > > >>
> >>>> > > > > >
> >>>> > > > >
> >>>> > > >
> >>>> > >
> >>>> >
> >>>>
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:154)
> >>>> > > > > > >>         at
> >>>> > > > > > >>
> >>>> > > > > > >>
> >>>> > > > > >
> >>>> > > > >
> >>>> > > >
> >>>> > >
> >>>> >
> >>>>
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:638)
> >>>> > > > > > >>         at
> >>>> > > > > > >>
> >>>> > > > > > >>
> >>>> > > > > >
> >>>> > > > >
> >>>> > > >
> >>>> > >
> >>>> >
> >>>>
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:387)
> >>>> > > > > > >>
> >>>> > > > > > >> Before restarting the crashed RS's,
I have applied a
> >>>> > > > > 'stop_replication'
> >>>> > > > > > >> cmd. Then fired up the RS's again.
They've started o.k.
> >>>> but once
> >>>> > > > I've
> >>>> > > > > > hit
> >>>> > > > > > >> 'start_replication' they have crashed
once again. The
> >>>> second
> >>>> > crash
> >>>> > > > log
> >>>> > > > > > >> http://pastebin.com/8Nb5epJJ has the
same initial
> >>>> exception
> >>>> > > > > > >> (org.apache.zookeeper.KeeperException$NoNodeException:
> >>>> > > > > > >> KeeperErrorCode = NoNode). I've started
the crash region
> >>>> servers
> >>>> > > > again
> >>>> > > > > > >> without replication and currently
all is well, but I need
> >>>> to
> >>>> > start
> >>>> > > > > > >> replication asap.
> >>>> > > > > > >>
> >>>> > > > > > >> Does anyone have an idea what's going
on and how can I
> >>>> solve it
> >>>> > ?
> >>>> > > > > > >>
> >>>> > > > > > >> Thanks,
> >>>> > > > > > >> Amit
> >>>> > > > > > >>
> >>>> > > > > > >
> >>>> > > > > > >
> >>>> > > > > >
> >>>> > > > >
> >>>> > > >
> >>>> > >
> >>>> >
> >>>>
> >>>
> >>>
> >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message