hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stack <st...@duboce.net>
Subject Re: region servers dying - flush request - YCSB
Date Thu, 10 Mar 2011 22:48:10 GMT
That looks like someone trying to connect to the master but they are
not doing the handshake properly.  Do you have old versions of hbase
around the place? Or some other process connection the HBase Master?

As to unresponsive for 100 seconds, what was going on on your cluster?
 Any clues in Master log or in any of the regionserver logs?

St.Ack

2011/3/10 M.Deniz OKTAR <deniz.oktar@gmail.com>:
> Hi,
>
> Still working on the issue. This is one of the last trials I am doing before
> ordering a new cluster.
>
> I was going through yahoo benchmark again and hbase became non responsive
> for a long time, (about 100 secs) benchmark results were 0 throughput for
> that time and eventually, benchmark failed.
>
> I didn't get any exceptions but this one, on the Master node, iletken-test-0
> is also the Master node, so it was trying to recover a file from the same
> node.
>
> Any suggestions?
>
> Thanks.
>
> ---
>
>
>   org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog 3
> of 14:
> hdfs://iletken-test-0:30001/hbase3/.logs/iletken-test-2,60020,1299794182845/iletken-test-2%3A60020.1299794988383,
> length=86340652
>
> 2011-03-11 00:15:54,825 INFO org.apache.hadoop.hbase.util.FSUtils:
> Recovering file
> hdfs://iletken-test-0:30001/hbase3/.logs/iletken-test-2,60020,1299794182845/iletken-test-2%3A60020.1299794988383
>
> 2011-03-11 00:15:56,675 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server
> listener on 60000: readAndProcess threw exception java.io.IOException:
> Connection reset by peer. Count of bytes read: 0
>
> java.io.IOException: Connection reset by peer
>
> at sun.nio.ch.FileDispatcher.read0(Native Method)
>
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
>
>
>
> --
> deniz
>
> On Thu, Mar 10, 2011 at 12:51 AM, Jean-Daniel Cryans <jdcryans@apache.org>wrote:
>
>> This is a JVM error, and there seems to be a lot of them in the recent
>> versions. I personally recommend using u16 or u17.
>>
>> J-D
>>
>> On Wed, Mar 9, 2011 at 1:01 AM, Erdem Agaoglu <erdem.agaoglu@gmail.com>
>> wrote:
>> > I don't know if it's related but i've seen a dead regionserver a little
>> > while ago too. But in our case .out file showed some JVM crash along with
>> a
>> > hs_err dump in hbase home (attached below). We were running 0.90.0 at the
>> > moment and we upgraded to 0.90.1 in hopes of a fix but nothing changed.
>> >
>> > The crash happened when we ran RowCounter job, with 12 simultaneous tasks
>> on
>> > 11 machines, 132 simultaneous tasks total.  Table has ~100k rows with
>> ~700kB
>> > per row. We were trying different max_region_size values, and that
>> instance
>> > had 100M. We have ~1000 regions total. Our machines have 24G ram and
>> ganglia
>> > shows no resource starvation nor swapping.
>> >
>> > These happened about a week ago, but i wasn't able to test further so i
>> > delayed asking here, but if it has any relation to problem Deniz's
>> having,
>> > this log might be useful.
>> >
>> > #
>> > # A fatal error has been detected by the Java Runtime Environment:
>> > #
>> > #  SIGSEGV (0xb) at pc=0x00007fd02e23825b, pid=30204, tid=140531828942608
>> > #
>> > # JRE version: 6.0_23-b05
>> > # Java VM: Java HotSpot(TM) 64-Bit Server VM (19.0-b09 mixed mode
>> > linux-amd64 compressed oops)
>> > # Problematic frame:
>> > # V  [libjvm.so+0x30325b]
>> > #
>> > # If you would like to submit a bug report, please visit:
>> > #   http://java.sun.com/webapps/bugreport/crash.jsp
>> > #
>> >
>> > ---------------  T H R E A D  ---------------
>> >
>> > Current thread (0x000000004013d800):  ConcurrentGCThread [stack:
>> > 0x00007fd01dae6000,0x00007fd01dbe7000] [id=30221]
>> >
>> > siginfo:si_signo=SIGSEGV: si_errno=0, si_code=1 (SEGV_MAPERR),
>> > si_addr=0x0000000000000018
>> >
>> > Registers:
>> > RAX=0x000000004013cbd8, RBX=0x00007fd02e8c6960, RCX=0x0000000000000003,
>> > RDX=0x0000000000000000
>> > RSP=0x00007fd01dbe58c0, RBP=0x00007fd01dbe58e0, RSI=0x00007fd02e8aa9b0,
>> > RDI=0x0000000000000010
>> > R8 =0x00000000175f6400, R9 =0x000000000000000c, R10=0x00007fd02e8aa754,
>> > R11=0x00000000000209bc
>> > R12=0x00007fd01dbe5a00, R13=0x00000006c3272000, R14=0x000000004013c9c0,
>> > R15=0x00007fd01dbe5ab0
>> > RIP=0x00007fd02e23825b, EFL=0x0000000000010246,
>> CSGSFS=0x0000000000000033,
>> > ERR=0x0000000000000004
>> >  TRAPNO=0x000000000000000e
>> >
>> > Register to memory mapping:
>> >
>> > RAX=0x000000004013cbd8
>> > 0x000000004013cbd8 is pointing to unknown location
>> >
>> > RBX=0x00007fd02e8c6960
>> > 0x00007fd02e8c6960: <offset 0x991960> in
>> > /usr/lib64/jvm/java-1.6.0-sun-1.6.0/jre/lib/amd64/server/libjvm.so at
>> > 0x00007fd02df35000
>> >
>> > RCX=0x0000000000000003
>> > 0x0000000000000003 is pointing to unknown location
>> >
>> > RDX=0x0000000000000000
>> > 0x0000000000000000 is pointing to unknown location
>> >
>> > RSP=0x00007fd01dbe58c0
>> > 0x00007fd01dbe58c0 is pointing to unknown location
>> >
>> > RBP=0x00007fd01dbe58e0
>> > 0x00007fd01dbe58e0 is pointing to unknown location
>> >
>> > RSI=0x00007fd02e8aa9b0
>> > 0x00007fd02e8aa9b0: <offset 0x9759b0> in
>> > /usr/lib64/jvm/java-1.6.0-sun-1.6.0/jre/lib/amd64/server/libjvm.so at
>> > 0x00007fd02df35000
>> >
>> > RDI=0x0000000000000010
>> > 0x0000000000000010 is pointing to unknown location
>> >
>> > R8 =0x00000000175f6400
>> > 0x00000000175f6400 is pointing to unknown location
>> >
>> > R9 =0x000000000000000c
>> > 0x000000000000000c is pointing to unknown location
>> >
>> > R10=0x00007fd02e8aa754
>> > 0x00007fd02e8aa754: <offset 0x975754> in
>> > /usr/lib64/jvm/java-1.6.0-sun-1.6.0/jre/lib/amd64/server/libjvm.so at
>> > 0x00007fd02df35000
>> >
>> > R11=0x00000000000209bc
>> > 0x00000000000209bc is pointing to unknown location
>> >
>> > R12=0x00007fd01dbe5a00
>> > 0x00007fd01dbe5a00 is pointing to unknown location
>> >
>> > R13=0x00000006c3272000
>> >
>> >
>> > On Tue, Mar 8, 2011 at 6:21 PM, 陈加俊 <cjjvictory@gmail.com> wrote:
>> >
>> >> Htable had disabled when ctr+c ?
>> >>
>> >> 2011/3/8, M.Deniz OKTAR <deniz.oktar@gmail.com>:
>> >> > Something new came up!
>> >> >
>> >> > I tried to truncate the 'usertable' which had ~12M entries.
>> >> >
>> >> > Shell stayed at "disabling table" for a long time. The processes was
>> >> there
>> >> > but there were no requests. So I quit the state by ctrl-c.
>> >> >
>> >> > Then tried count 'usertable' to see if data remains, shell gave an
>> error
>> >> and
>> >> > one of the regionservers had a log such as below,
>> >> >
>> >> > The master logs were also similar (tried to disable again, and the
>> master
>> >> > log is from that trial)
>> >> >
>> >> >
>> >> > Regionserver 2:
>> >> >
>> >> > 2011-03-08 16:47:24,852 DEBUG
>> >> > org.apache.hadoop.hbase.regionserver.HRegionServer:
>> >> > NotServingRegionException; Region is not online:
>> >> > usertable,,1299593459085.d37bb124feaf8f5d08e51064a36596f8.
>> >> > 2011-03-08 16:47:27,765 DEBUG
>> >> > org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=39.63
>> >> MB,
>> >> > free=4.65 GB, max=4.68 GB, blocks=35, accesses=376070, hits=12035,
>> >> > hitRatio=3.20%%, cachingAccesses=12070, cachingHits=12035,
>> >> > cachingHitsRatio=99.71%%, evictions=0, evicted=0, evictedPerRun=NaN
>> >> > 2011-03-08 16:47:28,863 DEBUG
>> >> > org.apache.hadoop.hbase.regionserver.HRegionServer:
>> >> > NotServingRegionException; Region is not online:
>> >> > usertable,,1299593459085.d37bb124feaf8f5d08e51064a36596f8.
>> >> > 2011-03-08 16:47:28,865 ERROR
>> >> > org.apache.hadoop.hbase.regionserver.HRegionServer:
>> >> > org.apache.hadoop.hbase.UnknownScannerException: Name: -1
>> >> >         at
>> >> >
>> >>
>> org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:1795)
>> >> >         at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown
>> Source)
>> >> >         at
>> >> >
>> >>
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>> >> >         at java.lang.reflect.Method.invoke(Method.java:597)
>> >> >         at
>> >> > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
>> >> >         at
>> >> >
>> >>
>> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039)
>> >> >
>> >> >
>> >> >
>> >> > Masterserver:
>> >> > .
>> >> > .
>> >> > . (same thing)
>> >> > 2011-03-08 16:51:34,679 INFO
>> >> > org.apache.hadoop.hbase.master.AssignmentManager: Region has been
>> >> > PENDING_CLOSE for too long, running forced unassign again on
>> >> >
>> >>
>> region=usertable,user1948102037,1299592536693.d5bae6bbe54aa182e1215ab626e0011e.
>> >> >
>> >> >
>> >> > --
>> >> > deniz
>> >> >
>> >> >
>> >> > On Tue, Mar 8, 2011 at 4:34 PM, M.Deniz OKTAR <deniz.oktar@gmail.com>
>> >> wrote:
>> >> >
>> >> >> Hi all,
>> >> >>
>> >> >> Thanks for the support. I'v been trying to replicate the problem
>> since
>> >> >> this
>> >> >> morning. Before doing that, played with the configuration. I used
to
>> >> have
>> >> >> only one user and set all the permissions according to that. Now
I'v
>> >> >> followed the cloudera manuals and set permissions for hdfs and
mapred
>> >> >> users.
>> >> >> (changed the hbase-env.sh)
>> >> >>
>> >> >> I had 2 trials, on both the yahoo test failed because of receiving
>> lost
>> >> of
>> >> >> "0"s but the region servers didn't die. At some points in the test,
>> >> (also
>> >> >> when failed) , hbase master gave exceptions about not being able
to
>> >> reach
>> >> >> one of the servers. I also lost the ssh connection to that server,
>> but
>> >> >> after
>> >> >> a while it recovered. (also hmaster) The last thing in the
>> regionserver
>> >> >> logs
>> >> >> was that it was going for a flush.
>> >> >>
>> >> >> I'll be going over the tests again and provide you with clean log
>> files
>> >> >> from all servers. (hadoop, hbase, namenode, masternode logs)
>> >> >>
>> >> >> If you have any suggestions or directions for me to better diagnose
>> the
>> >> >> problem, that would be lovely.
>> >> >>
>> >> >> btw: these servers do not have ECC memory but I do not see any
>> >> corruption
>> >> >> in data.
>> >> >>
>> >> >> Thanks!
>> >> >>
>> >> >> --
>> >> >> deniz
>> >> >>
>> >> >>
>> >> >> On Mon, Mar 7, 2011 at 7:47 PM, Jean-Daniel Cryans
>> >> >> <jdcryans@apache.org>wrote:
>> >> >>
>> >> >>> Along with a bigger portion of the log, it be might good to
check if
>> >> >>> there's anything in the .out file that looks like a jvm error.
>> >> >>>
>> >> >>> J-D
>> >> >>>
>> >> >>> On Mon, Mar 7, 2011 at 9:22 AM, M.Deniz OKTAR <
>> deniz.oktar@gmail.com>
>> >> >>> wrote:
>> >> >>> > I run every kind of benchmark I could find on those machines
and
>> they
>> >> >>> seemed
>> >> >>> > to work fine. Did memory/disk tests too.
>> >> >>> >
>> >> >>> > The master node or other nodes provide some information
and
>> >> exceptions
>> >> >>> about
>> >> >>> > that they can't reach to the dead node.
>> >> >>> >
>> >> >>> > Btw sometimes the process does not die but looses the
connection.
>> >> >>> >
>> >> >>> > --
>> >> >>> >
>> >> >>> > deniz
>> >> >>> >
>> >> >>> > On Mon, Mar 7, 2011 at 7:19 PM, Stack <stack@duboce.net>
wrote:
>> >> >>> >
>> >> >>> >> I'm stumped.  I have nothing to go on when no death
throes or
>> >> >>> >> complaints.  This hardware for sure is healthy?  Other
stuff runs
>> >> w/o
>> >> >>> >> issue?
>> >> >>> >> St.Ack
>> >> >>> >>
>> >> >>> >> On Mon, Mar 7, 2011 at 8:48 AM, M.Deniz OKTAR <
>> >> deniz.oktar@gmail.com>
>> >> >>> >> wrote:
>> >> >>> >> > I don't know if its normal but I see alot of
'0's in the test
>> >> >>> >> > results
>> >> >>> >> when
>> >> >>> >> > it tends to fail, such as:
>> >> >>> >> >
>> >> >>> >> >  1196 sec: 7394901 operations; 0 current ops/sec;
>> >> >>> >> >
>> >> >>> >> > --
>> >> >>> >> > deniz
>> >> >>> >> >
>> >> >>> >> > On Mon, Mar 7, 2011 at 6:46 PM, M.Deniz OKTAR
<
>> >> deniz.oktar@gmail.com
>> >> >>> >
>> >> >>> >> wrote:
>> >> >>> >> >
>> >> >>> >> >> Hi,
>> >> >>> >> >>
>> >> >>> >> >> Thanks for the effort, answers below:
>> >> >>> >> >>
>> >> >>> >> >>
>> >> >>> >> >>
>> >> >>> >> >>
>> >> >>> >> >> On Mon, Mar 7, 2011 at 6:08 PM, Stack <stack@duboce.net>
>> wrote:
>> >> >>> >> >>
>> >> >>> >> >>> On Mon, Mar 7, 2011 at 5:43 AM, M.Deniz
OKTAR <
>> >> >>> deniz.oktar@gmail.com>
>> >> >>> >> >>> wrote:
>> >> >>> >> >>> > We have a 5 node cluster, 4 of them
being region servers. I
>> am
>> >> >>> >> running a
>> >> >>> >> >>> > custom workload with YCSB and when
the data is loading
>> (heavy
>> >> >>> insert)
>> >> >>> >> at
>> >> >>> >> >>> > least one of the region servers
are dying after about
>> 600000
>> >> >>> >> operations.
>> >> >>> >> >>>
>> >> >>> >> >>>
>> >> >>> >> >>> Tell us the character of your 'custom
workload' please.
>> >> >>> >> >>>
>> >> >>> >> >>>
>> >> >>> >> >> The workload is below, the part that fails
is the loading part
>> >> >>> (-load)
>> >> >>> >> >> which inserts all the records first)
>> >> >>> >> >>
>> >> >>> >> >> recordcount=10000000
>> >> >>> >> >> operationcount=3000000
>> >> >>> >> >> workload=com.yahoo.ycsb.workloads.CoreWorkload
>> >> >>> >> >>
>> >> >>> >> >> readallfields=true
>> >> >>> >> >>
>> >> >>> >> >> readproportion=0.5
>> >> >>> >> >> updateproportion=0.1
>> >> >>> >> >> scanproportion=0
>> >> >>> >> >> insertproportion=0.35
>> >> >>> >> >> readmodifywriteproportion=0.05
>> >> >>> >> >>
>> >> >>> >> >> requestdistribution=zipfian
>> >> >>> >> >>
>> >> >>> >> >>
>> >> >>> >> >>
>> >> >>> >> >>
>> >> >>> >> >>>
>> >> >>> >> >>> > There are no abnormalities in the
logs as far as I can see,
>> >> the
>> >> >>> only
>> >> >>> >> >>> common
>> >> >>> >> >>> > point is that all of them(in different
trials, different
>> >> region
>> >> >>> >> servers
>> >> >>> >> >>> > fail) request for a flush as the
last logs, given below.
>> .out
>> >> >>> files
>> >> >>> >> are
>> >> >>> >> >>> > empty. I am looking at the /var/log/hbase
folder for logs.
>> >> >>> Running
>> >> >>> >> sun
>> >> >>> >> >>> java
>> >> >>> >> >>> > 6 latest version. I couldn't find
any logs that indicates a
>> >> >>> problem
>> >> >>> >> with
>> >> >>> >> >>> > java. Tried the tests with openjdk
and had the same
>> results.
>> >> >>> >> >>> >
>> >> >>> >> >>>
>> >> >>> >> >>> Its strange that flush is the last thing
in your log.  The
>> >> process
>> >> >>> is
>> >> >>> >> >>> dead?  We are exiting w/o a note in logs?
 Thats unusual.  We
>> >> >>> usually
>> >> >>> >> >>> scream loudly when dying.
>> >> >>> >> >>>
>> >> >>> >> >>
>> >> >>> >> >> Yes, thats the strange part. The last line
is a flush as if
>> the
>> >> >>> process
>> >> >>> >> >> never failed. Yes, the process is dead and
hbase cannot see
>> the
>> >> >>> node.
>> >> >>> >> >>
>> >> >>> >> >>
>> >> >>> >> >>>
>> >> >>> >> >>> > I have set ulimits(50000) and xceivers(20000)
for multiple
>> >> users
>> >> >>> and
>> >> >>> >> >>> certain
>> >> >>> >> >>> > that they are correct.
>> >> >>> >> >>>
>> >> >>> >> >>> The first line in an hbase log prints
out the ulimit it sees.
>> >>  You
>> >> >>> >> >>> might check that the hbase process for
sure is picking up
>> your
>> >> >>> ulimit
>> >> >>> >> >>> setting.
>> >> >>> >> >>>
>> >> >>> >> >>> That was a mistake I did a couple of
days ago, checked it
>> with
>> >> cat
>> >> >>> >> >> /proc/<pid of reginserver>/limits 
and all related users like
>> >> >>> 'hbase'
>> >> >>> >> has
>> >> >>> >> >> those limits. Checked the logs:
>> >> >>> >> >>
>> >> >>> >> >> Mon Mar  7 06:41:15 EET 2011 Starting regionserver
on test-1
>> >> >>> >> >> ulimit -n 52768
>> >> >>> >> >>
>> >> >>> >> >>>
>> >> >>> >> >>> > Also in the kernel logs, there are
no apparent problems.
>> >> >>> >> >>> >
>> >> >>> >> >>>
>> >> >>> >> >>> (The mystery compounds)
>> >> >>> >> >>>
>> >> >>> >> >>> > 2011-03-07 15:07:58,301 DEBUG
>> >> >>> >> >>> > org.apache.hadoop.hbase.regionserver.CompactSplitThread:
>> >> >>> Compaction
>> >> >>> >> >>> > requested for
>> >> >>> >> >>> >
>> >> >>> >>
>> >> >>>
>> >> usertable,user1030079237,1299502934627.257739740f58da96d5c5ef51a7d3efc3.
>> >> >>> >> >>> > because regionserver60020.cacheFlusher;
priority=3,
>> compaction
>> >> >>> queue
>> >> >>> >> >>> size=18
>> >> >>> >> >>> > 2011-03-07 15:07:58,301 DEBUG
>> >> >>> >> >>> org.apache.hadoop.hbase.regionserver.HRegion:
>> >> >>> >> >>> > NOT flushing memstore for region
>> >> >>> >> >>> >
>> >> >>> >> >>>
>> >> >>> >>
>> >> >>>
>> >>
>> usertable,user1601881548,1299502135191.f8efb9aa0922fa8a6a53fc49b8155ebc.,
>> >> >>> >> >>> > flushing=false, writesEnabled=false
>> >> >>> >> >>> > 2011-03-07 15:07:58,301 DEBUG
>> >> >>> >> >>> org.apache.hadoop.hbase.regionserver.HRegion:
>> >> >>> >> >>> > Started memstore flush for
>> >> >>> >> >>> >
>> >> >>> >> >>>
>> >> >>> >>
>> >> >>>
>> >>
>> usertable,user1662209069,1299502135191.9fa929e6fb439843cffb604dea3f88f6.,
>> >> >>> >> >>> > current region memstore size 68.6m
>> >> >>> >> >>> > 2011-03-07 15:07:58,310 DEBUG
>> >> >>> >> >>> org.apache.hadoop.hbase.regionserver.HRegion:
>> >> >>> >> >>> > Flush requested on
>> >> >>> >> >>> >
>> >> >>> >>
>> >> >>>
>> >> usertable,user1601881548,1299502135191.f8efb9aa0922fa8a6a53fc49b8155ebc.
>> >> >>> >> >>> > -end of log file-
>> >> >>> >> >>> > ---
>> >> >>> >> >>> >
>> >> >>> >> >>>
>> >> >>> >> >>> Nothing more?
>> >> >>> >> >>>
>> >> >>> >> >>>
>> >> >>> >> >> No, nothing after that. But quite a lot of
logs before that, I
>> >> can
>> >> >>> send
>> >> >>> >> >> them if you'd like.
>> >> >>> >> >>
>> >> >>> >> >>
>> >> >>> >> >>
>> >> >>> >> >>> Thanks,
>> >> >>> >> >>> St.Ack
>> >> >>> >> >>>
>> >> >>> >> >>
>> >> >>> >> >> Thanks alot!
>> >> >>> >> >>
>> >> >>> >> >>
>> >> >>> >> >
>> >> >>> >>
>> >> >>> >
>> >> >>>
>> >> >>
>> >> >>
>> >> >
>> >>
>> >> --
>> >> 从我的移动设备发送
>> >>
>> >> Thanks & Best regards
>> >> jiajun
>> >>
>> >
>> >
>> >
>> > --
>> > erdem agaoglu
>> >
>>
>

Mime
View raw message