hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dejan Menges <dejan.men...@gmail.com>
Subject Re: Strange issue when DataNode goes down
Date Mon, 23 Mar 2015 14:59:40 GMT
Interesting discussion I also found, gives me some more light on what
Nicolas is talking about - https://issues.apache.org/jira/browse/HDFS-3703

On Mon, Mar 23, 2015 at 3:53 PM Bryan Beaudreault <bbeaudreault@hubspot.com>
wrote:

> So it is safe to set hbase.lease.recovery.timeout lower if you also
> set heartbeat.recheck.interval lower (lowering that 10.5 min dead node
> timer)?  Or is it recommended to not touch either of those?
>
> Reading the above with interest, thanks for digging in here guys.
>
> On Mon, Mar 23, 2015 at 10:13 AM, Nicolas Liochon <nkeywal@gmail.com>
> wrote:
>
> > If the node is actually down it's fine. But the node may not be that down
> > (CAP theorem here); and then it's looking for trouble.
> > HDFS, by default declare a node as dead after 10:30. 15 minutes is an
> extra
> > security. It seems your hdfs settings are different (or there is a
> bug...).
> > There should be some info in the hdfs logs.
> >
> > On Mon, Mar 23, 2015 at 3:05 PM, Dejan Menges <dejan.menges@gmail.com>
> > wrote:
> >
> > > Will take a look.
> > >
> > > Actually, if node is down (someone unplugged network cable, it just
> died,
> > > whatever) what's chance it's going to become live so write can
> continue?
> > On
> > > the other side, HBase is not starting recovery trying to contact node
> > which
> > > is not there anymore, and even elected as dead on Namenode side
> (another
> > > thing I didn't understood quite good).
> > >
> > > So what I was expecting is that as soon as Namenode decided node is
> dead,
> > > that it would be enough for RegionServer to stop trying to recover from
> > the
> > > dead node, but it wasn't the case. Also, this whole MTTR article in
> HBase
> > > book doesn't work at all with this parameter set to it's default value
> > (15
> > > minutes).
> > >
> > > So I'm still struggling to figure out what's drawback exactly on this?
> > >
> > > On Mon, Mar 23, 2015 at 2:50 PM Nicolas Liochon <nkeywal@gmail.com>
> > wrote:
> > >
> > > > Thanks for the explanation. There is an issue if you modify this
> > setting
> > > > however.
> > > > hbase tries to recover the lease (i.e. be sure that nobody is
> writing)
> > > > If you change hbase.lease.recovery.timeout hbase will start the
> > recovery
> > > > (i.e. start to read) even if it's not sure that nobody's writing.
> That
> > > > means there is a dataloss risk.
> > > > Basically, you must not see this warning: WARN
> > > > org.apache.hadoop.hbase.util.FSHDFSUtils: Cannot recoverLease after
> > > trying
> > > > for[]
> > > >
> > > > The recoverLease must succeed. The fact that it does not after X
> tries
> > is
> > > > strange.
> > > > There may be a mistmatch between the hbase parameters and the hdfs
> > ones.
> > > > You may need to have a look at the comments in FSHDFSUtils.java
> > > >
> > > >
> > > >
> > > >
> > > > On Mon, Mar 23, 2015 at 2:15 PM, Dejan Menges <
> dejan.menges@gmail.com>
> > > > wrote:
> > > >
> > > > > I found the issue and fixed it, and will try to explain here what
> was
> > > > > exactly in our case in case someone else finds this interesting
> too.
> > > > >
> > > > > So initially, we had (couple of times) some network and hardware
> > issues
> > > > in
> > > > > our datacenters. When one server would die (literary die, we had
> some
> > > > issue
> > > > > with PSUs) we saw issues with overall cluster performance on HBase
> > > side.
> > > > As
> > > > > cluster is quite big and live, it was also quite hard to figure out
> > > exact
> > > > > root cause and how to fix the stuff we wanted to fix.
> > > > >
> > > > > So I set up another cluster, four nodes (with DataNode and
> > > RegionServer)
> > > > > and two other nodes with HMaster and Namenode in HA, using same
> stuff
> > > we
> > > > > use on production. We pumped some data into it, and I was able to
> > > > reproduce
> > > > > same issue last week on it. What I tried to do is to cut one server
> > > (shut
> > > > > down it's interface) when all is good with cluster, when we have
> > load,
> > > > and
> > > > > see what's going to happen.
> > > > >
> > > > > On Friday, after Nicolas mentioned, I started taking a look in
> HBase
> > > logs
> > > > > on the node which was mentioned in HMaster log as the one taking
> over
> > > > > regions for the dead server. Basically what I was able to observe
> was
> > > 15
> > > > > minutes time (+- couple of seconds, what was also interesting, and
> > will
> > > > got
> > > > > later to that) between HMaster figures out that one of it's
> > > RegionServers
> > > > > is dead, and the time one of the mentioned nodes starts taking over
> > > those
> > > > > regions and they start appearing in HMaster's Web UI.
> > > > >
> > > > > I then set up everything like mentioned here
> > > > > http://hbase.apache.org/book.html#mttr - but still had exactly the
> > > same
> > > > > issues. Went over (again and again) all currently configured stuff,
> > but
> > > > > still had the same issue.
> > > > >
> > > > > Then I started looking into HDFS. Opened NameNode UI, saw all is
> > good,
> > > > took
> > > > > one node down, was also looking RegionServer logs in the same time
> -
> > > and
> > > > I
> > > > > also see that it took ~15 minutes for Namenode to elect dead node
> as
> > > > dead.
> > > > > Somehow in the same moment regions started getting back to life. I
> > > > remember
> > > > > in some older versions dfs timeout checks and check retries. Looked
> > > into
> > > > > defaults for our Hadoop version -
> > > > >
> > > > > http://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/
> > > > hadoop-hdfs/hdfs-default.xml
> > > > > - and saw there that there's no recheck parameter anymore. Strange,
> > as
> > > on
> > > > > StackOverflow I found post from month ago, for newer version than
> we
> > > use
> > > > > (we use 2.4.1, guy was using 2.6 - dfs.namenode.heartbeat.
> > > > recheck-interval)
> > > > > I set it to 10 seconds as he mentioned, having checks every three
> > > seconds
> > > > > (default) and got DataNode elected as dead in ~45 seconds, as he
> > > > mentioned.
> > > > > Not sure why this parameter is not documented, but obviously it
> > works.
> > > > >
> > > > > Then figured out it still didn't fix our HBase failover issue. I
> was
> > > > > looking into HBase book again and again, and then saw this part:
> > > > >
> > > > > "How much time we allow elapse between calls to recover lease.
> Should
> > > be
> > > > > larger than the dfs timeout."
> > > > >
> > > > > This was description for hbase.lease.recovery.dfs.timeout. Wasn't
> > sure
> > > > from
> > > > > the comment what of all timeouts that's possible to set in
> > Hadoop/HBase
> > > > and
> > > > > that has something to do with DFS is this all about. But picked
> > > > > hbase.lease.recovery.timeout, and that was the catch.
> > > > >
> > > > > Initially, by default, hbase.lease.recovery.timeout is set to 15
> > > minutes.
> > > > > Not sure why, and wasn't able to find yet why, but getting this
> down
> > to
> > > > one
> > > > > minute (what's more than OK for us) I was able to get rid of our
> > issue.
> > > > Not
> > > > > also sure why this is not mentioned in MTTR section in HBase book,
> as
> > > > > obviously MTTR doesn't work at all with this default timeout, at
> > least
> > > it
> > > > > doesn't work the way we expected it to work.
> > > > >
> > > > > So thanks again for everyone being spammed with this, and specially
> > > > thanks
> > > > > to Nicolas pointing me to the right direction.
> > > > >
> > > > > On Mon, Mar 23, 2015 at 1:37 PM Nicolas Liochon <nkeywal@gmail.com
> >
> > > > wrote:
> > > > >
> > > > > > the attachments are rejected by the mailing list, can you put
> then
> > on
> > > > > > pastebin?
> > > > > >
> > > > > > stale is mandatory (so it's good), but the issue here is just
> > before.
> > > > The
> > > > > > region server needs to read the file. In order to be sure that
> > there
> > > is
> > > > > no
> > > > > > data loss, it needs to "recover the lease", that means ensuring
> > that
> > > > > nobody
> > > > > > is writing the file. The regionserver calls the namenode to do
> this
> > > > > > recoverLease. So there should be some info in the namenode logs.
> > You
> > > > have
> > > > > > HDFS-4721 on your hdfs? The hbase book details (more or less...)
> > this
> > > > > > recoverLease stuff.
> > > > > >
> > > > > >
> > > > > > On Mon, Mar 23, 2015 at 10:33 AM, Dejan Menges <
> > > dejan.menges@gmail.com
> > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > And also, just checked - dfs.namenode.avoid.read.stale.
> datanode
> > and
> > > > > > > dfs.namenode.avoid.write.stale.datanode
> > > > > > > are both true, and dfs.namenode.stale.datanode.interval is
> set to
> > > > > > default
> > > > > > > 30000.
> > > > > > >
> > > > > > > On Mon, Mar 23, 2015 at 10:03 AM Dejan Menges <
> > > > dejan.menges@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Nicolas,
> > > > > > > >
> > > > > > > > Please find log attached.
> > > > > > > >
> > > > > > > > As I see it now more clearly, it was trying to recover HDFS
> > WALs
> > > > from
> > > > > > > node
> > > > > > > > that's dead:
> > > > > > > >
> > > > > > > > 2015-03-23 08:53:44,381 WARN
> > > > > org.apache.hadoop.hbase.util.FSHDFSUtils:
> > > > > > > > Cannot recoverLease after trying for 900000ms
> > > > > > > > (hbase.lease.recovery.timeout); continuing, but may be
> > > > DATALOSS!!!;
> > > > > > > > attempt=40 on
> > > > > > > >
> > > > > > > file=hdfs://{my_hmaster_node}:8020/hbase/WALs/{node_i_intent
> > > > > > ionally_get_down_by_getting_network_down},60020,1426862900
> > > > > > 506-splitting/{node_i_intentionally_get_down_by_gett
> > > > > > ing_network_down}%2C60020%2C1426862900506.1427096924508
> > > > > > > > after 908210ms
> > > > > > > >
> > > > > > > > And as you can see from the log, it tried 40 times, what took
> > it
> > > > > > exactly
> > > > > > > > 15 minutes.
> > > > > > > >
> > > > > > > > There's probably some parameter to tune to cut it of from 40
> > > times
> > > > /
> > > > > 15
> > > > > > > > minutes to something more useful, as for 15 minutes we don't
> > have
> > > > our
> > > > > > > > regions available, and HDFS have however replication factor
> 3.
> > > > > > > >
> > > > > > > > Googling, if I figure out what's this I will post it here.
> Will
> > > > also
> > > > > > > > appreciate if someone knows how to cut this down.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > >
> > > > > > > > Dejan
> > > > > > > >
> > > > > > > > On Fri, Mar 20, 2015 at 3:49 PM Nicolas Liochon <
> > > nkeywal@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > > >
> > > > > > > >> The split is done by the region servers (the master
> > > coordinates).
> > > > Is
> > > > > > > there
> > > > > > > >> some interesting stuff in their logs?
> > > > > > > >>
> > > > > > > >> On Fri, Mar 20, 2015 at 3:38 PM, Dejan Menges <
> > > > > dejan.menges@gmail.com
> > > > > > >
> > > > > > > >> wrote:
> > > > > > > >>
> > > > > > > >> > With client issue was that it was retrying connecting to
> the
> > > > same
> > > > > > > region
> > > > > > > >> > servers even when they were reassigned. Lowering it down
> > > helped
> > > > in
> > > > > > > this
> > > > > > > >> > specific use case, but yes, we still want servers to
> > > reallocate
> > > > > > > quickly.
> > > > > > > >> >
> > > > > > > >> > What got me here:
> > > > > > > >> >
> > > > > > > >> > http://hbase.apache.org/book.html#mttr
> > > > > > > >> >
> > > > > > > >> > I basically set configuration exactly the same way as it's
> > > > > explained
> > > > > > > >> here.
> > > > > > > >> > *zookeeper.session.timeout* is (and was before) 60000 (one
> > > > > minute).
> > > > > > > >> >
> > > > > > > >> > So basically what happens is: - simulating network issues
> we
> > > had
> > > > > > > >> recently).
> > > > > > > >> > - After short time I see in HBase that my RegionServer is
> > > dead,
> > > > > and
> > > > > > as
> > > > > > > >> > total number of regions I see previous total minus number
> of
> > > > > regions
> > > > > > > >> that
> > > > > > > >> > were hosted on the node hosting RegionServer that just
> > > > > > 'disappeared'.
> > > > > > > >> > - In this point I want my clus
> > > > > > > >> >
> > > > > > > >> > - I have test cluster consisting of four nodes, every node
> > > being
> > > > > > > >> DataNode
> > > > > > > >> > and RegionServer.
> > > > > > > >> > - I simulate network partition on one (connect to it
> through
> > > > > console
> > > > > > > and
> > > > > > > >> > take network interface downter to recover as soon as
> > possible,
> > > > to
> > > > > > > start
> > > > > > > >> > serving missing regions.
> > > > > > > >> > - First thing I see in HMaster logs are:
> > > > > > > >> >
> > > > > > > >> > 2015-03-20 14:17:26,015 INFO
> > > > > > > >> > org.apache.hadoop.hbase.zookeeper.RegionServerTracker:
> > > > > RegionServer
> > > > > > > >> > ephemeral node deleted, processing expiration
> > > > > > > >> > [{name_of_node_I_took_down},60020,1426860403261]
> > > > > > > >> >
> > > > > > > >> > 2015-03-20 14:17:26,067 INFO
> > > > > > > >> >
> > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler:
> > > > > > > Splitting
> > > > > > > >> > logs for {name_of_node_I_took_down},60020,1426860403261
> > before
> > > > > > > >> assignment.
> > > > > > > >> >
> > > > > > > >> > 2015-03-20 14:17:26,105 INFO
> > > > > > > >> > org.apache.hadoop.hbase.master.SplitLogManager: dead
> > splitlog
> > > > > > workers
> > > > > > > [
> > > > > > > >> > {name_of_node_I_took_down},60020,1426860403261]
> > > > > > > >> >
> > > > > > > >> > 2015-03-20 14:17:26,107 INFO
> > > > > > > >> > org.apache.hadoop.hbase.master.SplitLogManager: started
> > > > splitting
> > > > > 1
> > > > > > > >> logs in
> > > > > > > >> > [hdfs://{fqdn_of_hmaster}:8020/hbase/WALs/{name_of_node_
> I_
> > > > > > took_down}
> > > > > > > >> > ,60020,1426860403261-splitting]
> > > > > > > >> >
> > > > > > > >> > 2015-03-20 14:17:26,150 INFO
> > > > > > > >> > org.apache.hadoop.hbase.master.SplitLogManager: task
> > > > > > > >> > /hbase/splitWAL/WALs%2F
> > > > > > > >> >
> > > {name_of_node_I_took_down}%2C60020%2C1426860403261-splitting%2F
> > > > > > > >> > {name_of_node_I_took_down}%252C60020%252C1426860403261.
> 14268
> > > > > > 60404905
> > > > > > > >> > acquired by {fqdn_of_another_live_hnode},
> 60020,1426859445623
> > > > > > > >> >
> > > > > > > >> > 2015-03-20 14:17:26,182 INFO
> > > > > > > >> > org.apache.hadoop.hbase.master.SplitLogManager: total
> tasks
> > =
> > > 1
> > > > > > > >> unassigned
> > > > > > > >> > = 0 tasks={/hbase/splitWAL/WALs%
> 2F{name_of_node_I_took_down}
> > > > > > > >> >
> > > > > > > >> > %2C60020%2C1426860403261-splitting%2F{name_of_node_I_
> took_
> > > > > > > >> down}%252C60020%252C1426860403261.1426860404905=last_update
> > > > > > > >> > = 1426861046182 last_version = 2 cur_worker_name =
> > > > > > > >> > {fqdn_of_another_live_node},60020,1426859445623 status =
> > > > > > in_progress
> > > > > > > >> > incarnation = 0 resubmits = 0 batch = installed = 1 done
> = 0
> > > > > error =
> > > > > > > 0}
> > > > > > > >> >
> > > > > > > >> > 2015-03-20 14:17:31,183 INFO
> > > > > > > >> > org.apache.hadoop.hbase.master.SplitLogManager: total
> tasks
> > =
> > > 1
> > > > > > > >> unassigned
> > > > > > > >> > = 0 tasks={/hbase/splitWAL/WALs%
> 2F{name_of_node_I_took_down}
> > > > > > > >> >
> > > > > > > >> > %2C60020%2C1426860403261-splitting%2F{name_of_node_I_
> took_
> > > > > > > >> down}%252C60020%252C1426860403261.1426860404905=last_update
> > > > > > > >> > = 1426861046182 last_version = 2 cur_worker_name =
> > > > > > > >> > {fqdn_of_another_live_node},60020,1426859445623 status =
> > > > > > in_progress
> > > > > > > >> > incarnation = 0 resubmits = 0 batch = installed = 1 done
> = 0
> > > > > error =
> > > > > > > 0}
> > > > > > > >> >
> > > > > > > >> > 2015-03-20 14:17:36,184 INFO
> > > > > > > >> > org.apache.hadoop.hbase.master.SplitLogManager: total
> tasks
> > =
> > > 1
> > > > > > > >> unassigned
> > > > > > > >> > = 0 tasks={/hbase/splitWAL/WALs%
> 2F{name_of_node_I_took_down}
> > > > > > > >> >
> > > > > > > >> > %2C60020%2C1426860403261-splitting%2F{name_of_node_I_
> took_
> > > > > > > >> down}%252C60020%252C1426860403261.1426860404905=last_update
> > > > > > > >> > = 1426861046182 last_version = 2 cur_worker_name =
> > > > > > > >> > {fqdn_of_another_live_node},60020,1426859445623 status =
> > > > > > in_progress
> > > > > > > >> > incarnation = 0 resubmits = 0 batch = installed = 1 done
> = 0
> > > > > error =
> > > > > > > 0}
> > > > > > > >> >
> > > > > > > >> > 2015-03-20 14:17:42,185 INFO
> > > > > > > >> > org.apache.hadoop.hbase.master.SplitLogManager: total
> tasks
> > =
> > > 1
> > > > > > > >> unassigned
> > > > > > > >> > = 0 tasks={/hbase/splitWAL/WALs%
> 2F{name_of_node_I_took_down}
> > > > > > > >> >
> > > > > > > >> > %2C60020%2C1426860403261-splitting%2F{name_of_node_I_
> took_
> > > > > > > >> down}%252C60020%252C1426860403261.1426860404905=last_update
> > > > > > > >> > = 1426861046182 last_version = 2 cur_worker_name =
> > > > > > > >> > {fqdn_of_another_live_node},60020,1426859445623 status =
> > > > > > in_progress
> > > > > > > >> > incarnation = 0 resubmits = 0 batch = installed = 1 done
> = 0
> > > > > error =
> > > > > > > 0}
> > > > > > > >> >
> > > > > > > >> > 2015-03-20 14:17:48,184 INFO
> > > > > > > >> > org.apache.hadoop.hbase.master.SplitLogManager: total
> tasks
> > =
> > > 1
> > > > > > > >> unassigned
> > > > > > > >> > = 0 tasks={/hbase/splitWAL/WALs%
> 2F{name_of_node_I_took_down}
> > > > > > > >> >
> > > > > > > >> > %2C60020%2C1426860403261-splitting%2F{name_of_node_I_
> took_
> > > > > > > >> down}%252C60020%252C1426860403261.1426860404905=last_update
> > > > > > > >> > = 1426861046182 last_version = 2 cur_worker_name =
> > > > > > > >> > {fqdn_of_another_live_node},60020,1426859445623 status =
> > > > > > in_progress
> > > > > > > >> > incarnation = 0 resubmits = 0 batch = installed = 1 done
> = 0
> > > > > error =
> > > > > > > 0}
> > > > > > > >> > In the meantime, In hbase...out log I got this:
> > > > > > > >> >
> > > > > > > >> > ==> hbase-hbase-master-{fqdn_of_my_hmaster_node}.out <==
> > > > > > > >> >
> > > > > > > >> > java.io.IOException: Call to
> > > > > > > >> > {name_of_node_I_took_down}/{ip_of_local_interface_I_took_
> dow
> > > > > > n}:60020
> > > > > > > >> > failed on local exception:
> > > > > > > >> > org.apache.hadoop.hbase.ipc.RpcClient$
> CallTimeoutException:
> > > > Call
> > > > > > > >> > id=93152, waitTime=60044, rpcTimeout=60000
> > > > > > > >> >
> > > > > > > >> > at
> > > org.apache.hadoop.hbase.ipc.RpcClient.wrapException(RpcClien
> > > > > > > >> t.java:1532)
> > > > > > > >> >
> > > > > > > >> > at
> > org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:
> > > > 1502)
> > > > > > > >> >
> > > > > > > >> > at
> > > > > > > >> >
> > > > > > > >> > org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(
> Rpc
> > > > > > > >> Client.java:1684)
> > > > > > > >> >
> > > > > > > >> > at
> > > > > > > >> >
> > > > > > > >> > org.apache.hadoop.hbase.ipc.RpcClient$
> BlockingRpcChannelImpl
> > > > > > ementati
> > > > > > > >> on.callBlockingMethod(RpcClient.java:1737)
> > > > > > > >> >
> > > > > > > >> > at
> > > > > > > >> >
> > > > > > > >> > org.apache.hadoop.hbase.protobuf.generated.AdminProtos$
> > > > > > AdminService$
> > > > > > > >> BlockingStub.getRegionInfo(AdminProtos.java:20806)
> > > > > > > >> >
> > > > > > > >> > at
> > > > > > > >> >
> > > > > > > >> > org.apache.hadoop.hbase.client.HBaseAdmin.
> getCompactionState
> > > > > > > >> (HBaseAdmin.java:2524)
> > > > > > > >> >
> > > > > > > >> > at
> > > > > > > >> >
> > > > > > > >> > org.apache.hadoop.hbase.generated.master.table_jsp._
> jspServi
> > > > > > > >> ce(table_jsp.java:167)
> > > > > > > >> >
> > > > > > > >> > at
> > org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.
> > > > > > java:98)
> > > > > > > >> >
> > > > > > > >> > at
> > > javax.servlet.http.HttpServlet.service(HttpServlet.java:770)
> > > > > > > >> >
> > > > > > > >> > at
> > > org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder
> > > > > > > >> .java:511)
> > > > > > > >> >
> > > > > > > >> > at
> > > > > > > >> >
> > > > > > > >> > org.mortbay.jetty.servlet.ServletHandler$CachedChain.
> doFilte
> > > > > > > >> r(ServletHandler.java:1221)
> > > > > > > >> >
> > > > > > > >> > at
> > > > > > > >> >
> > > > > > > >> > org.apache.hadoop.http.lib.StaticUserWebFilter$
> StaticUserFil
> > > > > > > >> ter.doFilter(StaticUserWebFilter.java:109)
> > > > > > > >> >
> > > > > > > >> > at
> > > > > > > >> >
> > > > > > > >> > org.mortbay.jetty.servlet.ServletHandler$CachedChain.
> doFilte
> > > > > > > >> r(ServletHandler.java:1212)
> > > > > > > >> >
> > > > > > > >> > at
> > > > > > > >> >
> > > > > > > >> > org.apache.hadoop.http.HttpServer$QuotingInputFilter.
> doFilte
> > > > > > > >> r(HttpServer.java:1081)
> > > > > > > >> >
> > > > > > > >> > at
> > > > > > > >> >
> > > > > > > >> > org.mortbay.jetty.servlet.ServletHandler$CachedChain.
> doFilte
> > > > > > > >> r(ServletHandler.java:1212)
> > > > > > > >> >
> > > > > > > >> > at
> > > > > > >
> > > org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
> > > > > > > >> >
> > > > > > > >> > at
> > > > > > > >> >
> > > > > > > >> > org.mortbay.jetty.servlet.ServletHandler$CachedChain.
> doFilte
> > > > > > > >> r(ServletHandler.java:1212)
> > > > > > > >> >
> > > > > > > >> > at
> > > org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandl
> > > > > > > >> er.java:399)
> > > > > > > >> >
> > > > > > > >> > at
> > > > > > > >> > org.mortbay.jetty.security.SecurityHandler.handle(
> SecurityHa
> > > > > > > >> ndler.java:216)
> > > > > > > >> >
> > > > > > > >> > at
> > > org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandl
> > > > > > > >> er.java:182)
> > > > > > > >> >
> > > > > > > >> > at
> > > org.mortbay.jetty.handler.ContextHandler.handle(ContextHandl
> > > > > > > >> er.java:766)
> > > > > > > >> >
> > > > > > > >> > at
> > > org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.
> > > > > > > >> java:450)
> > > > > > > >> >
> > > > > > > >> > at
> > > > > > > >> >
> > > > > > > >> > org.mortbay.jetty.handler.ContextHandlerCollection.
> handle(Co
> > > > > > > >> ntextHandlerCollection.java:230)
> > > > > > > >> >
> > > > > > > >> > at
> > > org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapp
> > > > > > > >> er.java:152)
> > > > > > > >> >
> > > > > > > >> > at org.mortbay.jetty.Server.handle(Server.java:326)
> > > > > > > >> >
> > > > > > > >> > at
> > > org.mortbay.jetty.HttpConnection.handleRequest(HttpConnectio
> > > > > > > >> n.java:542)
> > > > > > > >> >
> > > > > > > >> > at
> > > > > > > >> >
> > > > > > > >> > org.mortbay.jetty.HttpConnection$RequestHandler.
> headerComple
> > > > > > > >> te(HttpConnection.java:928)
> > > > > > > >> >
> > > > > > > >> > at
> > org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
> > > > > > > >> >
> > > > > > > >> > at
> > > > > org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
> > > > > > > >> >
> > > > > > > >> > at
> > > > > org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
> > > > > > > >> >
> > > > > > > >> > at
> > > > > > > >> >
> > > > > > > >> > org.mortbay.io.nio.SelectChannelEndPoint.run(
> SelectChannelEn
> > > > > > > >> dPoint.java:410)
> > > > > > > >> >
> > > > > > > >> > at
> > > > > > > >> >
> > > > > > > >> > org.mortbay.thread.QueuedThreadPool$PoolThread.run(
> > > > > > > >> QueuedThreadPool.java:582)
> > > > > > > >> >
> > > > > > > >> > Caused by: org.apache.hadoop.hbase.ipc.RpcClient$
> > > > > > CallTimeoutException:
> > > > > > > >> Call
> > > > > > > >> > id=93152, waitTime=60044, rpcTimeout=60000
> > > > > > > >> >
> > > > > > > >> > at
> > > > > > > >> >
> > > > > > > >> > org.apache.hadoop.hbase.ipc.RpcClient$Connection.
> cleanupCall
> > > > > > > >> s(RpcClient.java:1234)
> > > > > > > >> >
> > > > > > > >> > at
> > > > > > > >> >
> > > > > > > >> > org.apache.hadoop.hbase.ipc.RpcClient$Connection.
> readRespons
> > > > > > > >> e(RpcClient.java:1171)
> > > > > > > >> >
> > > > > > > >> > at
> > > org.apache.hadoop.hbase.ipc.RpcClient$Connection.run(RpcClie
> > > > > > > >> nt.java:751)
> > > > > > > >> > Beside this same issue, please note that first message was
> > at
> > > > > > > 2015-03-20
> > > > > > > >> > 14:17:26,015. And then (we got to the point when it
> started
> > > > > > > transition):
> > > > > > > >> >
> > > > > > > >> > 2015-03-20 14:32:35,059 INFO
> > > > > > > >> > org.apache.hadoop.hbase.master.SplitLogManager: task
> > > > > > > >> > /hbase/splitWAL/WALs%2F
> > > > > > > >> >
> > > {name_of_node_I_took_down}%2C60020%2C1426860403261-splitting%2F
> > > > > > > >> > {name_of_node_I_took_down}%252C60020%252C1426860403261.
> 14268
> > > > > > 60404905
> > > > > > > >> > entered state: DONE
> > > {fqdn_of_new_live_node},60020,1426859445623
> > > > > > > >> >
> > > > > > > >> > 2015-03-20 14:32:35,109 INFO
> > > > > > > >> > org.apache.hadoop.hbase.master.SplitLogManager: Done
> > splitting
> > > > > > > >> > /hbase/splitWAL/WALs%2F{name_of_node_I_took_down}
> > > > > > > >> >
> > > %2C60020%2C1426860403261-splitting%2F{name_of_node_I_took_down}
> > > > > > > >> > %252C60020%252C1426860403261.1426860404905
> > > > > > > >> >
> > > > > > > >> > 2015-03-20 14:32:35,190 INFO
> > > > > > > >> > org.apache.hadoop.hbase.master.SplitLogManager: finished
> > > > splitting
> > > > > > > >> (more
> > > > > > > >> > than or equal to) 9 bytes in 1 log files in
> > > > > > > >> >
> > > > > > > >> > [hdfs://{fqdn_of_my_hmaster_node}:8020/hbase/WALs/{name_
> of_
> > > > > > > >> node_I_took_down},60020,1426860403261-splitting]
> > > > > > > >> > in 909083ms
> > > > > > > >> >
> > > > > > > >> > 2015-03-20 14:32:35,191 INFO
> org.apache.hadoop.hbase.master
> > > > > > > >> .RegionStates:
> > > > > > > >> > Transitioned {0e7cc87a4ef6c47a779186f5bf79a01c
> state=OPEN,
> > > > > > > >> > ts=1426860639088,
> > > > > > > server={name_of_node_I_took_down},60020,1426860403261}
> > > > > > > >> to
> > > > > > > >> > {0e7cc87a4ef6c47a779186f5bf79a01c state=OFFLINE,
> > > > ts=1426861955191,
> > > > > > > >> server=
> > > > > > > >> > {name_of_node_I_took_down},60020,1426860403261}
> > > > > > > >> >
> > > > > > > >> > 2015-03-20 14:32:35,191 INFO
> org.apache.hadoop.hbase.master
> > > > > > > >> .RegionStates:
> > > > > > > >> > Offlined 0e7cc87a4ef6c47a779186f5bf79a01c from
> > > > > > > >> {name_of_node_I_took_down}
> > > > > > > >> > ,60020,1426860403261
> > > > > > > >> >
> > > > > > > >> > 2015-03-20 14:32:35,191 INFO
> org.apache.hadoop.hbase.master
> > > > > > > >> .RegionStates:
> > > > > > > >> > Transitioned {25ab6e7b42e36ddaa723d71be5954543
> state=OPEN,
> > > > > > > >> > ts=1426860641783,
> > > > > > > server={name_of_node_I_took_down},60020,1426860403261}
> > > > > > > >> to
> > > > > > > >> > {25ab6e7b42e36ddaa723d71be5954543 state=OFFLINE,
> > > > ts=1426861955191,
> > > > > > > >> server=
> > > > > > > >> > {name_of_node_I_took_down},60020,1426860403261}
> > > > > > > >> >
> > > > > > > >> > 2015-03-20 14:32:35,191 INFO
> org.apache.hadoop.hbase.master
> > > > > > > >> .RegionStates:
> > > > > > > >> > Offlined 25ab6e7b42e36ddaa723d71be5954543 from
> > > > > > > >> {name_of_node_I_took_down}
> > > > > > > >> > ,60020,1426860403261
> > > > > > > >> > At this point, note that it finished SplitLogManager task
> at
> > > > > > 14:32:35
> > > > > > > >> and
> > > > > > > >> > started transitioning just after that. So this is 15
> minutes
> > > > that
> > > > > > I'm
> > > > > > > >> > talking about.
> > > > > > > >> >
> > > > > > > >> > What am I missing?
> > > > > > > >> >
> > > > > > > >> >
> > > > > > > >> > On Fri, Mar 20, 2015 at 2:37 PM Nicolas Liochon <
> > > > > nkeywal@gmail.com>
> > > > > > > >> wrote:
> > > > > > > >> >
> > > > > > > >> > > You've changed the value of hbase.zookeeper.timeout to
> 15
> > > > > > minutes? A
> > > > > > > >> very
> > > > > > > >> > > reasonable target is 1 minute before relocating the
> > regions.
> > > > > > That's
> > > > > > > >> the
> > > > > > > >> > > default iirc. You can push it to 20s, but then
> > > > > > gc-stopping-the-world
> > > > > > > >> > > becomes more of an issue. 15 minutes is really a lot.
> The
> > > hdfs
> > > > > > stale
> > > > > > > >> mode
> > > > > > > >> > > must always be used, with a lower timeout than the hbase
> > > one.
> > > > > > > >> > >
> > > > > > > >> > > Client side there should be nothing to do (excepted for
> > > > advanced
> > > > > > > >> stuff);
> > > > > > > >> > at
> > > > > > > >> > > each retry the client checks the location of the
> regions.
> > If
> > > > you
> > > > > > > lower
> > > > > > > >> > the
> > > > > > > >> > > number of retry the client will fail sooner, but usually
> > you
> > > > > don't
> > > > > > > >> want
> > > > > > > >> > the
> > > > > > > >> > > client to fail, you want the servers to reallocate
> > quickly.
> > > > > > > >> > >
> > > > > > > >> > > On Fri, Mar 20, 2015 at 1:36 PM, Dejan Menges <
> > > > > > > dejan.menges@gmail.com
> > > > > > > >> >
> > > > > > > >> > > wrote:
> > > > > > > >> > >
> > > > > > > >> > > > Hi,
> > > > > > > >> > > >
> > > > > > > >> > > > Sorry for little bit late update, but managed to
> narrow
> > it
> > > > > > little
> > > > > > > >> bit
> > > > > > > >> > > down.
> > > > > > > >> > > >
> > > > > > > >> > > > We didn't update yet, as we are using Hortonworks
> > > > distribution
> > > > > > > right
> > > > > > > >> > now,
> > > > > > > >> > > > and even if we update we will get 0.98.4. However,
> looks
> > > > that
> > > > > > > issue
> > > > > > > >> > here
> > > > > > > >> > > > was in our use case and configuration (still looking
> > into
> > > > it).
> > > > > > > >> > > >
> > > > > > > >> > > > Basically, initially I saw that when one server goes
> > down,
> > > > we
> > > > > > > start
> > > > > > > >> > > having
> > > > > > > >> > > > performance issues in general, but it managed to be on
> > our
> > > > > > client
> > > > > > > >> side,
> > > > > > > >> > > due
> > > > > > > >> > > > to caching, and clients were trying to reconnect to
> > nodes
> > > > that
> > > > > > > were
> > > > > > > >> > > offline
> > > > > > > >> > > > and later trying to get regions from those nodes too.
> > This
> > > > is
> > > > > > > >> basically
> > > > > > > >> > > why
> > > > > > > >> > > > on server side I didn't manage to see anything in logs
> > > that
> > > > > > would
> > > > > > > >> be at
> > > > > > > >> > > > least little bit interesting and point me into desired
> > > > > > direction.
> > > > > > > >> > > >
> > > > > > > >> > > > Another question that popped up to me is - in case
> > server
> > > is
> > > > > > down
> > > > > > > >> (and
> > > > > > > >> > > with
> > > > > > > >> > > > it DataNode and HRegionServer it was hosting) - what's
> > > > optimal
> > > > > > > time
> > > > > > > >> to
> > > > > > > >> > > set
> > > > > > > >> > > > for HMaster to consider server dead reassign regions
> > > > somewhere
> > > > > > > >> else, as
> > > > > > > >> > > > this is another performance bottleneck we hit during
> > > > inability
> > > > > > to
> > > > > > > >> > access
> > > > > > > >> > > > regions? In our case it's configured to 15 minutes,
> and
> > > > simple
> > > > > > > logic
> > > > > > > >> > > tells
> > > > > > > >> > > > me if you want it earlier then configure lower number
> of
> > > > > > retries,
> > > > > > > >> but
> > > > > > > >> > > issue
> > > > > > > >> > > > is as always in details, so not sure if anyone knows
> > some
> > > > > better
> > > > > > > >> math
> > > > > > > >> > for
> > > > > > > >> > > > this?
> > > > > > > >> > > >
> > > > > > > >> > > > And last question - is it possible to manually force
> > HBase
> > > > to
> > > > > > > >> reassign
> > > > > > > >> > > > regions? In this case, while HMaster is retrying to
> > > contact
> > > > > node
> > > > > > > >> that's
> > > > > > > >> > > > dead, it's impossible to force it using 'balancer'
> > > command.
> > > > > > > >> > > >
> > > > > > > >> > > > Thanks a lot!
> > > > > > > >> > > >
> > > > > > > >> > > > Dejan
> > > > > > > >> > > >
> > > > > > > >> > > > On Tue, Mar 17, 2015 at 9:37 AM Dejan Menges <
> > > > > > > >> dejan.menges@gmail.com>
> > > > > > > >> > > > wrote:
> > > > > > > >> > > >
> > > > > > > >> > > > > Hi,
> > > > > > > >> > > > >
> > > > > > > >> > > > > To be very honest - there's no particular reason why
> > we
> > > > > stick
> > > > > > to
> > > > > > > >> this
> > > > > > > >> > > > one,
> > > > > > > >> > > > > beside just lack of time currently to go through
> > upgrade
> > > > > > > process,
> > > > > > > >> but
> > > > > > > >> > > > looks
> > > > > > > >> > > > > to me that's going to be next step.
> > > > > > > >> > > > >
> > > > > > > >> > > > > Had a crazy day, didn't have time to go through all
> > logs
> > > > > > again,
> > > > > > > >> plus
> > > > > > > >> > > one
> > > > > > > >> > > > > of the machines (last one where we had this issue)
> is
> > > > fully
> > > > > > > >> > > reprovisioned
> > > > > > > >> > > > > yesterday so I don't have logs from there anymore.
> > > > > > > >> > > > >
> > > > > > > >> > > > > Beside upgrading,  what I will talk about today, can
> > you
> > > > > just
> > > > > > > >> point
> > > > > > > >> > me
> > > > > > > >> > > to
> > > > > > > >> > > > > the specific RPC issue in 0.98.0? Thing is that we
> > have
> > > > some
> > > > > > > >> strange
> > > > > > > >> > > > > moments with RPC in this case, and just want to see
> if
> > > > > that's
> > > > > > > the
> > > > > > > >> > same
> > > > > > > >> > > > > thing (and we were even suspecting to RPC).
> > > > > > > >> > > > >
> > > > > > > >> > > > > Thanks a lot!
> > > > > > > >> > > > > Dejan
> > > > > > > >> > > > >
> > > > > > > >> > > > > On Mon, Mar 16, 2015 at 9:32 PM, Andrew Purtell <
> > > > > > > >> apurtell@apache.org
> > > > > > > >> > >
> > > > > > > >> > > > > wrote:
> > > > > > > >> > > > >
> > > > > > > >> > > > >> Is there a particular reason why you are using
> HBase
> > > > > 0.98.0?
> > > > > > > The
> > > > > > > >> > > latest
> > > > > > > >> > > > >> 0.98 release is 0.98.11. There's a known
> performance
> > > > issue
> > > > > > with
> > > > > > > >> > 0.98.0
> > > > > > > >> > > > >> pertaining to RPC that was fixed in later releases,
> > you
> > > > > > should
> > > > > > > >> move
> > > > > > > >> > up
> > > > > > > >> > > > >> from
> > > > > > > >> > > > >> 0.98.0. In addition hundreds of improvements and
> bug
> > > > fixes
> > > > > > have
> > > > > > > >> gone
> > > > > > > >> > > > into
> > > > > > > >> > > > >> the ten releases since 0.98.0.
> > > > > > > >> > > > >>
> > > > > > > >> > > > >> On Mon, Mar 16, 2015 at 6:40 AM, Dejan Menges <
> > > > > > > >> > dejan.menges@gmail.com
> > > > > > > >> > > >
> > > > > > > >> > > > >> wrote:
> > > > > > > >> > > > >>
> > > > > > > >> > > > >> > Hi All,
> > > > > > > >> > > > >> >
> > > > > > > >> > > > >> > We have a strange issue with HBase performance
> > > (overall
> > > > > > > cluster
> > > > > > > >> > > > >> > performance) in case one of datanodes in the
> > cluster
> > > > > > > >> unexpectedly
> > > > > > > >> > > goes
> > > > > > > >> > > > >> > down.
> > > > > > > >> > > > >> >
> > > > > > > >> > > > >> > So scenario is like follows:
> > > > > > > >> > > > >> > - Cluster works fine, it's stable.
> > > > > > > >> > > > >> > - One DataNode unexpectedly goes down (PSU issue,
> > > > network
> > > > > > > >> issue,
> > > > > > > >> > > > >> anything)
> > > > > > > >> > > > >> > - Whole HBase cluster goes down (performance
> > becomes
> > > so
> > > > > bad
> > > > > > > >> that
> > > > > > > >> > we
> > > > > > > >> > > > >> have to
> > > > > > > >> > > > >> > restart all RegionServers to get it back to
> life).
> > > > > > > >> > > > >> >
> > > > > > > >> > > > >> > Most funny and latest issue that happened was
> that
> > we
> > > > > added
> > > > > > > new
> > > > > > > >> > node
> > > > > > > >> > > > to
> > > > > > > >> > > > >> the
> > > > > > > >> > > > >> > cluster (having 8 x 4T SATA disks) and we left
> just
> > > > > > DataNode
> > > > > > > >> > running
> > > > > > > >> > > > on
> > > > > > > >> > > > >> it
> > > > > > > >> > > > >> > to give it couple of days to get some data. At
> some
> > > > point
> > > > > > in
> > > > > > > >> time,
> > > > > > > >> > > due
> > > > > > > >> > > > >> to
> > > > > > > >> > > > >> > hardware issue, server rebooted (twice during
> three
> > > > > hours)
> > > > > > in
> > > > > > > >> > moment
> > > > > > > >> > > > >> when
> > > > > > > >> > > > >> > it had maybe 5% of data it would have in a couple
> > of
> > > > > days.
> > > > > > > >> Nothing
> > > > > > > >> > > > else
> > > > > > > >> > > > >> > beside DataNode was running, and once it went
> down,
> > > it
> > > > > > > affected
> > > > > > > >> > > > literary
> > > > > > > >> > > > >> > everything, and restarting RegionServers in the
> end
> > > > fixed
> > > > > > it.
> > > > > > > >> > > > >> >
> > > > > > > >> > > > >> > We are using HBase 0.98.0 with Hadoop 2.4.0
> > > > > > > >> > > > >> >
> > > > > > > >> > > > >>
> > > > > > > >> > > > >>
> > > > > > > >> > > > >>
> > > > > > > >> > > > >> --
> > > > > > > >> > > > >> Best regards,
> > > > > > > >> > > > >>
> > > > > > > >> > > > >>    - Andy
> > > > > > > >> > > > >>
> > > > > > > >> > > > >> Problems worthy of attack prove their worth by
> > hitting
> > > > > back.
> > > > > > -
> > > > > > > >> Piet
> > > > > > > >> > > Hein
> > > > > > > >> > > > >> (via Tom White)
> > > > > > > >> > > > >>
> > > > > > > >> > > > >
> > > > > > > >> > > > >
> > > > > > > >> > > >
> > > > > > > >> > >
> > > > > > > >> >
> > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message