hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adrià Vilà" <av...@datknosys.com>
Subject Re: RegionServers shutdown randomly
Date Mon, 10 Aug 2015 17:16:28 GMT
Got the servers to stay up but I think the CORRUPT system had nothing to do because of the
following.....
  
  Today the master boot disk wouldn't start because of "Kernel panic"
 VFS: Cannot open root device "UUID=6c782ac9-0050-4837-b65a-77cb8a390772" or unknown-block(0,0)
Please append a correct "root=" boot option; here are the available partitions:
Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)
  
 What I did was recover a backup from few days ago (as this is a testing cluster) and (following
an advice) I applied the following "ulimit" change there (because my open files limit was
1024):
 vi /etc/security/limits.conf 
 added:
 * hard nofile 10000
 * soft nofile 10000
 root hard nofile 10000
 root soft nofile 10000
  
 I'm sorry to be unable to know for sure what was it the RegionServers to crash. It could
be the ulimit bottleneck or the restored snapshot solved it :S
  
 [ The snapshot I recovered still has the HDFS system corrupted, so I guess that wasn't it
]

  
 ... thank you all!!
  
 Desde: "Ted Yu" <yuzhihong@gmail.com>
 Enviado: sábado, 08 de agosto de 2015 5:54
Para: "user@hbase.apache.org" <user@hbase.apache.org>
Asunto: Re: RegionServers shutdown randomly   
>From what I heard, reporting of CORRUPT for WAL related files was false
alarm.

There is no evidence that hbase 1.1 produces corrupt WAL files.

Cheers

On Fri, Aug 7, 2015 at 7:59 PM, James Estes <james.estes@gmail.com> wrote:

> There is this
>
> http://mail-archives.apache.org/mod_mbox/hbase-user/201507.mbox/%3CCAE8tVdmyUfG%2BajK0gvMG_tLjoStZ0HjrQxJuuJzQ3Z%2B4vbzSuQ%40mail.gmail.com%3E
> Which points to
> https://issues.apache.org/jira/browse/HDFS-8809
>
> But (at least for us) this hasn't lead to region server
> crashing...though I'm definitely interested in what issues it may be
> able to cause.
>
> James
>
>
> On Fri, Aug 7, 2015 at 11:05 AM, Ted Yu <yuzhihong@gmail.com> wrote:
> > Some WAL related files were marked corrupt.
> >
> > Can you try repairing them ?
> >
> > Please check namenode log.
> > Search HDFS JIRA for any pending fix - I haven't tracked HDFS movement
> > closely recently.
> >
> > Thanks
> >
> > On Fri, Aug 7, 2015 at 7:54 AM, Adrià Vilà <avila@datknosys.com> wrote:
> >
> >> About the logs attached in this conversation: only w-0 and w-1 nodes had
> >> failed, first w-0 and then w-1
> >> 10.240.187.182 = w-2
> >> w-0 internal IP address is 10.240.164.0
> >> w-1 IP is 10.240.2.235
> >> m IP is 10.240.200.196
> >>
> >> FSCK (hadoop fsck / | egrep -v '^\.+$' | grep -v eplica) output:
> >> -
> >> Connecting to namenode via
> >> http://hdp-m.c.dks-hadoop.internal:50070/fsck?ugi=root&path=%2F FSCK
> >> started by root (auth:SIMPLE) from /10.240.200.196 for path / at Fri
> Aug
> >> 07 14:51:22 UTC 2015
> >>
> /apps/hbase/data/WALs/hdp-w-0.c.dks-hadoop.internal,16020,1438946915810-splitting/hdp-w-0.c.dks-hadoop.internal%2C1602
> >> 0%2C1438946915810..meta.1438950914376.meta: MISSING 1 blocks of total
> size
> >> 90 B......
> >>
> /apps/hbase/data/WALs/hdp-w-1.c.dks-hadoop.internal,16020,1438959061234/hdp-w-1.c.dks-hadoop.internal%2C16020%2C143895
> >> 9061234.default.1438959069800: MISSING 1 blocks of total size 90 B...
> >>
> /apps/hbase/data/WALs/hdp-w-2.c.dks-hadoop.internal,16020,1438959056208/hdp-w-2.c.dks-hadoop.internal%2C16020%2C143895
> >> 9056208..meta.1438959068352.meta: MISSING 1 blocks of total size 90 B.
> >>
> /apps/hbase/data/WALs/hdp-w-2.c.dks-hadoop.internal,16020,1438959056208/hdp-w-2.c.dks-hadoop.internal%2C16020%2C143895
> >> 9056208.default.1438959061922: MISSING 1 blocks of total size 90
> >> B...........................
> >>
> >> .........Status: CORRUPT
> >> Total size: 54919712019 B (Total open files size: 360 B)
> >> Total dirs: 1709 Total files: 2628
> >> Total symlinks: 0 (Files currently being written: 6)
> >> Total blocks (validated): 2692 (avg. block size 20401081 B) (Total open
> >> file blocks (not validated): 4)
> >> ********************************
> >> UNDER MIN REPL'D BLOCKS: 4 (0.1485884 %)
> >> CORRUPT FILES: 4
> >> MISSING BLOCKS: 4
> >> MISSING SIZE: 360 B
> >> ********************************
> >> Corrupt blocks: 0
> >> Number of data-nodes: 4
> >> Number of racks: 1
> >> FSCK ended at Fri Aug 07 14:51:26 UTC 2015 in 4511 milliseconds
> >>
> >> The filesystem under path '/' is CORRUPT
> >> -
> >>
> >> Thank you for your time.
> >>
> >> *Desde*: "Ted Yu" <yuzhihong@gmail.com>
> >> *Enviado*: viernes, 07 de agosto de 2015 16:07
> >> *Para*: "user@hbase.apache.org" <user@hbase.apache.org>,
> >> avila@datknosys.com
> >> *Asunto*: Re: RegionServers shutdown randomly
> >>
> >> Does 10.240.187.182 <http://10.240.187.182:50010/> correspond with w-0
> or
> >> m ?
> >>
> >> Looks like hdfs was intermittently unstable.
> >> Have you run fsck ?
> >>
> >> Cheers
> >>
> >> On Fri, Aug 7, 2015 at 12:59 AM, Adrià Vilà <avila@datknosys.com>
> wrote:
> >>>
> >>> Hello,
> >>>
> >>> HBase RegionServers fail once in a while:
> >>> - it can be any regionserver, not always de same - it can happen
> when
> >>> all the cluster is idle (at least not executing any human launched
> task)
> >>> - it can happen at any time, not always the same
> >>>
> >>> The cluster versions:
> >>> - Phoenix 4.4 (or 4.5) - HBase 1.1.1 - Hadoop/HDFS 2.7.1 -
> Zookeeper
> >>> 3.4.6 Some configs:
> >>> - ulimit -a
> >>> core file size (blocks, -c) 0
> >>> data seg size (kbytes, -d) unlimited
> >>> scheduling priority (-e) 0
> >>> file size (blocks, -f) unlimited
> >>> pending signals (-i) 103227
> >>> max locked memory (kbytes, -l) 64
> >>> max memory size (kbytes, -m) unlimited
> >>> open files (-n) 1024
> >>> pipe size (512 bytes, -p) 8
> >>> POSIX message queues (bytes, -q) 819200
> >>> real-time priority (-r) 0
> >>> stack size (kbytes, -s) 10240
> >>> cpu time (seconds, -t) unlimited
> >>> max user processes (-u) 103227
> >>> virtual memory (kbytes, -v) unlimited
> >>> file locks (-x) unlimited
> >>> - have increased default timeouts for: hbase rpc, zookeeper session,
> dks
> >>> socket, regionserver lease and client scanner.
> >>>
> >>> Next you can find the logs for the master, the regionserver that
> failed
> >>> first, another failed and the datanode log for master and worker.
> >>>
> >>>
> >>> The timing was aproximately:
> >>> 14:05 start hbase
> >>> 14.11 w-0 down
> >>> 14.14 w-1 down
> >>> 14.15 stop hbase
> >>>
> >>>
> >>> -------------
> >>> hbase master log (m)
> >>> -------------
> >>> 2015-08-06 14:11:13,640 ERROR
> >>> [PriorityRpcServer.handler=19,queue=1,port=16000]
> master.MasterRpcServices:
> >>> Region server hdp-w-0.c.dks-hadoop.internal,16020,1438869946905
> reported a
> >>> fatal error:
> >>> ABORTING region server
> >>> hdp-w-0.c.dks-hadoop.internal,16020,1438869946905: Unrecoverable
> exception
> >>> while closing region
> >>>
> SYSTEM.SEQUENCE,]\x00\x00\x00,1438013446516.888f017eb1c0557fbe7079b50626c891.,
> >>> still finishing close
> >>> Cause:
> >>> java.io.IOException: All datanodes DatanodeInfoWithStorage[
> >>> 10.240.187.182:50010,DS-8c63ac70-2f98-4084-91ee-a847b4f48ce2,DISK] are
> >>> bad. Aborting...
> >>> at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1117)
> >>> at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:909)
> >>> at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:412)
> >>>
> >>> --------------
> >>> hbase regionserver log (w-0)
> >>> --------------
> >>> 2015-08-06 14:11:13,611 INFO
> >>> [PriorityRpcServer.handler=0,queue=0,port=16020]
> >>> regionserver.RSRpcServices: Close 888f017eb1c0557fbe7079b50626c891,
> moving
> >>> to hdp-m.c.dks-hadoop.internal,16020,1438869954062
> >>> 2015-08-06 14:11:13,615 INFO
> >>>
> [StoreCloserThread-SYSTEM.SEQUENCE,]\x00\x00\x00,1438013446516.888f017eb1c0557fbe7079b50626c891.-1]
> >>> regionserver.HStore: Closed 0
> >>> 2015-08-06 14:11:13,616 FATAL
> >>> [regionserver/hdp-w-0.c.dks-hadoop.internal/10.240.164.0:16020
> .append-pool1-t1]
> >>> wal.FSHLog: Could not append. Requesting close of wal
> >>> java.io.IOException: All datanodes DatanodeInfoWithStorage[
> >>> 10.240.187.182:50010,DS-8c63ac70-2f98-4084-91ee-a847b4f48ce2,DISK] are
> >>> bad. Aborting...
> >>> at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1117)
> >>> at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:909)
> >>> at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:412)
> >>> 2015-08-06 14:11:13,617 ERROR [sync.4] wal.FSHLog: Error syncing,
> >>> request close of wal
> >>> java.io.IOException: All datanodes DatanodeInfoWithStorage[
> >>> 10.240.187.182:50010,DS-8c63ac70-2f98-4084-91ee-a847b4f48ce2,DISK] are
> >>> bad. Aborting...
> >>> at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1117)
> >>> at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:909)
> >>> at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:412)
> >>> 2015-08-06 14:11:13,617 FATAL [RS_CLOSE_REGION-hdp-w-0:16020-0]
> >>> regionserver.HRegionServer: ABORTING region server
> >>> hdp-w-0.c.dks-hadoop.internal,16020,1438869946905: Unrecoverable
> exception
> >>> while closing region
> >>>
> SYSTEM.SEQUENCE,]\x00\x00\x00,1438013446516.888f017eb1c0557fbe7079b50626c891.,
> >>> still finishing close
> >>> java.io.IOException: All datanodes DatanodeInfoWithStorage[
> >>> 10.240.187.182:50010,DS-8c63ac70-2f98-4084-91ee-a847b4f48ce2,DISK] are
> >>> bad. Aborting...
> >>> at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1117)
> >>> at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:909)
> >>> at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:412)
> >>> 2015-08-06 14:11:13,617 FATAL [RS_CLOSE_REGION-hdp-w-0:16020-0]
> >>> regionserver.HRegionServer: RegionServer abort: loaded coprocessors
> are:
> >>> [org.apache.phoenix.coprocessor.ServerCachingEndpointImpl,
> >>> org.apache.hadoop.hbase.regionserver.LocalIndexSplitter,
> >>> org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver,
> >>> org.apache.phoenix.coprocessor.GroupedAggregateRegionObserver,
> >>> org.apache.phoenix.coprocessor.ScanRegionObserver,
> >>> org.apache.phoenix.hbase.index.Indexer,
> >>> org.apache.phoenix.coprocessor.SequenceRegionObserver,
> >>> org.apache.phoenix.coprocessor.MetaDataEndpointImpl]
> >>> 2015-08-06 14:11:13,627 INFO [RS_CLOSE_REGION-hdp-w-0:16020-0]
> >>> regionserver.HRegionServer: Dump of metrics as JSON on abort: {
> >>> "beans" : [ {
> >>> "name" : "java.lang:type=Memory",
> >>> "modelerType" : "sun.management.MemoryImpl",
> >>> "Verbose" : true,
> >>> "HeapMemoryUsage" : {
> >>> "committed" : 2104754176,
> >>> "init" : 2147483648,
> >>> "max" : 2104754176,
> >>> "used" : 262288688
> >>> },
> >>> "ObjectPendingFinalizationCount" : 0,
> >>> "NonHeapMemoryUsage" : {
> >>> "committed" : 137035776,
> >>> "init" : 136773632,
> >>> "max" : 184549376,
> >>> "used" : 49168288
> >>> },
> >>> "ObjectName" : "java.lang:type=Memory"
> >>> } ],
> >>> "beans" : [ {
> >>> "name" : "Hadoop:service=HBase,name=RegionServer,sub=IPC",
> >>> "modelerType" : "RegionServer,sub=IPC",
> >>> "tag.Context" : "regionserver",
> >>> "tag.Hostname" : "hdp-w-0"
> >>> } ],
> >>> "beans" : [ {
> >>> "name" : "Hadoop:service=HBase,name=RegionServer,sub=Replication",
> >>> "modelerType" : "RegionServer,sub=Replication",
> >>> "tag.Context" : "regionserver",
> >>> "tag.Hostname" : "hdp-w-0"
> >>> } ],
> >>> "beans" : [ {
> >>> "name" : "Hadoop:service=HBase,name=RegionServer,sub=Server",
> >>> "modelerType" : "RegionServer,sub=Server",
> >>> "tag.Context" : "regionserver",
> >>> "tag.Hostname" : "hdp-w-0"
> >>> } ]
> >>> }
> >>> 2015-08-06 14:11:13,640 ERROR [sync.0] wal.FSHLog: Error syncing,
> >>> request close of wal
> >>> java.io.IOException: All datanodes DatanodeInfoWithStorage[
> >>> 10.240.187.182:50010,DS-8c63ac70-2f98-4084-91ee-a847b4f48ce2,DISK] are
> >>> bad. Aborting...
> >>> at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1117)
> >>> at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:909)
> >>> at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:412)
> >>> 2015-08-06 14:11:13,640 WARN
> >>> [regionserver/hdp-w-0.c.dks-hadoop.internal/10.240.164.0:16020
> .logRoller]
> >>> wal.FSHLog: Failed last sync but no outstanding unsync edits so falling
> >>> through to close; java.io.IOException: All datanodes
> >>> DatanodeInfoWithStorage[10.240.187.182:50010
> ,DS-8c63ac70-2f98-4084-91ee-a847b4f48ce2,DISK]
> >>> are bad. Aborting...
> >>> 2015-08-06 14:11:13,641 ERROR
> >>> [regionserver/hdp-w-0.c.dks-hadoop.internal/10.240.164.0:16020
> .logRoller]
> >>> wal.ProtobufLogWriter: Got IOException while writing trailer
> >>> java.io.IOException: All datanodes DatanodeInfoWithStorage[
> >>> 10.240.187.182:50010,DS-8c63ac70-2f98-4084-91ee-a847b4f48ce2,DISK] are
> >>> bad. Aborting...
> >>> at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1117)
> >>> at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:909)
> >>> at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:412)
> >>> 2015-08-06 14:11:13,641 WARN
> >>> [regionserver/hdp-w-0.c.dks-hadoop.internal/10.240.164.0:16020
> .logRoller]
> >>> wal.FSHLog: Riding over failed WAL close of
> >>>
> hdfs://hdp-m.c.dks-hadoop.internal:8020/apps/hbase/data/WALs/hdp-w-0.c.dks-hadoop.internal,16020,1438869946905/hdp-w-0.c.dks-hadoop.internal%2C16020%2C1438869946905.default.1438869949576,
> >>> cause="All datanodes DatanodeInfoWithStorage[10.240.187.182:50010
> ,DS-8c63ac70-2f98-4084-91ee-a847b4f48ce2,DISK]
> >>> are bad. Aborting...", errors=1; THIS FILE WAS NOT CLOSED BUT ALL EDITS
> >>> SYNCED SO SHOULD BE OK
> >>> 2015-08-06 14:11:13,642 INFO
> >>> [regionserver/hdp-w-0.c.dks-hadoop.internal/10.240.164.0:16020
> .logRoller]
> >>> wal.FSHLog: Rolled WAL
> >>>
> /apps/hbase/data/WALs/hdp-w-0.c.dks-hadoop.internal,16020,1438869946905/hdp-w-0.c.dks-hadoop.internal%2C16020%2C1438869946905.default.1438869949576
> >>> with entries=101, filesize=30.38 KB; new WAL
> >>>
> /apps/hbase/data/WALs/hdp-w-0.c.dks-hadoop.internal,16020,1438869946905/hdp-w-0.c.dks-hadoop.internal%2C16020%2C1438869946905.default.1438870273617
> >>> 2015-08-06 14:11:13,643 INFO [RS_CLOSE_REGION-hdp-w-0:16020-0]
> >>> regionserver.HRegionServer: STOPPED: Unrecoverable exception while
> closing
> >>> region
> >>>
> SYSTEM.SEQUENCE,]\x00\x00\x00,1438013446516.888f017eb1c0557fbe7079b50626c891.,
> >>> still finishing close
> >>> 2015-08-06 14:11:13,643 INFO
> >>> [regionserver/hdp-w-0.c.dks-hadoop.internal/10.240.164.0:16020
> .logRoller]
> >>> wal.FSHLog: Archiving
> >>>
> hdfs://hdp-m.c.dks-hadoop.internal:8020/apps/hbase/data/WALs/hdp-w-0.c.dks-hadoop.internal,16020,1438869946905/hdp-w-0.c.dks-hadoop.internal%2C16020%2C1438869946905.default.1438869949576
> >>> to
> >>>
> hdfs://hdp-m.c.dks-hadoop.internal:8020/apps/hbase/data/oldWALs/hdp-w-0.c.dks-hadoop.internal%2C16020%2C1438869946905.default.1438869949576
> >>> 2015-08-06 14:11:13,643 ERROR [RS_CLOSE_REGION-hdp-w-0:16020-0]
> >>> executor.EventHandler: Caught throwable while processing event
> >>> M_RS_CLOSE_REGION
> >>> java.lang.RuntimeException: java.io.IOException: All datanodes
> >>> DatanodeInfoWithStorage[10.240.187.182:50010
> ,DS-8c63ac70-2f98-4084-91ee-a847b4f48ce2,DISK]
> >>> are bad. Aborting...
> >>> at
> >>>
> org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:152)
> >>> at
> >>>
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
> >>> at
> >>>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> >>> at
> >>>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >>> at java.lang.Thread.run(Thread.java:745)
> >>> Caused by: java.io.IOException: All datanodes DatanodeInfoWithStorage[
> >>> 10.240.187.182:50010,DS-8c63ac70-2f98-4084-91ee-a847b4f48ce2,DISK] are
> >>> bad. Aborting...
> >>> at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1117)
> >>> at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:909)
> >>> at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:412)
> >>>
> >>> ------------
> >>> hbase regionserver log (w-1)
> >>> ------------
> >>> 2015-08-06 14:11:14,267 INFO [main-EventThread]
> >>> replication.ReplicationTrackerZKImpl:
> >>> /hbase-unsecure/rs/hdp-w-0.c.dks-hadoop.internal,16020,1438869946905
> znode
> >>> expired, triggering replicatorRemoved event
> >>> 2015-08-06 14:12:08,203 INFO [ReplicationExecutor-0]
> >>> replication.ReplicationQueuesZKImpl: Atomically moving
> >>> hdp-w-0.c.dks-hadoop.internal,16020,1438869946905's wals to my queue
> >>> 2015-08-06 14:12:56,252 INFO
> >>> [PriorityRpcServer.handler=5,queue=1,port=16020]
> >>> regionserver.RSRpcServices: Close 918ed7c6568e7500fb434f4268c5bbc5,
> moving
> >>> to hdp-m.c.dks-hadoop.internal,16020,1438869954062
> >>> 2015-08-06 14:12:56,260 INFO
> >>>
> [StoreCloserThread-SYSTEM.SEQUENCE,\x7F\x00\x00\x00,1438013446516.918ed7c6568e7500fb434f4268c5bbc5.-1]
> >>> regionserver.HStore: Closed 0
> >>> 2015-08-06 14:12:56,261 FATAL
> >>> [regionserver/hdp-w-1.c.dks-hadoop.internal/10.240.2.235:16020
> .append-pool1-t1]
> >>> wal.FSHLog: Could not append. Requesting close of wal
> >>> java.io.IOException: All datanodes DatanodeInfoWithStorage[
> >>> 10.240.187.182:50010,DS-8c63ac70-2f98-4084-91ee-a847b4f48ce2,DISK] are
> >>> bad. Aborting...
> >>> at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1117)
> >>> at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:909)
> >>> at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:412)
> >>> 2015-08-06 14:12:56,261 ERROR [sync.3] wal.FSHLog: Error syncing,
> >>> request close of wal
> >>> java.io.IOException: All datanodes DatanodeInfoWithStorage[
> >>> 10.240.187.182:50010,DS-8c63ac70-2f98-4084-91ee-a847b4f48ce2,DISK] are
> >>> bad. Aborting...
> >>> at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1117)
> >>> at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:909)
> >>> at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:412)
> >>> 2015-08-06 14:12:56,262 FATAL [RS_CLOSE_REGION-hdp-w-1:16020-0]
> >>> regionserver.HRegionServer: ABORTING region server
> >>> hdp-w-1.c.dks-hadoop.internal,16020,1438869946909: Unrecoverable
> exception
> >>> while closing region
> >>>
> SYSTEM.SEQUENCE,\x7F\x00\x00\x00,1438013446516.918ed7c6568e7500fb434f4268c5bbc5.,
> >>> still finishing close
> >>> java.io.IOException: All datanodes DatanodeInfoWithStorage[
> >>> 10.240.187.182:50010,DS-8c63ac70-2f98-4084-91ee-a847b4f48ce2,DISK] are
> >>> bad. Aborting...
> >>> at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1117)
> >>> at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:909)
> >>> at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:412)
> >>> 2015-08-06 14:12:56,262 FATAL [RS_CLOSE_REGION-hdp-w-1:16020-0]
> >>> regionserver.HRegionServer: RegionServer abort: loaded coprocessors
> are:
> >>> [org.apache.phoenix.coprocessor.ServerCachingEndpointImpl,
> >>> org.apache.hadoop.hbase.regionserver.LocalIndexSplitter,
> >>> org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver,
> >>> org.apache.phoenix.coprocessor.GroupedAggregateRegionObserver,
> >>> org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint,
> >>> org.apache.phoenix.coprocessor.ScanRegionObserver,
> >>> org.apache.phoenix.hbase.index.Indexer,
> >>> org.apache.phoenix.coprocessor.SequenceRegionObserver]
> >>> 2015-08-06 14:12:56,281 INFO [RS_CLOSE_REGION-hdp-w-1:16020-0]
> >>> regionserver.HRegionServer: Dump of metrics as JSON on abort: {
> >>> "beans" : [ {
> >>> "name" : "java.lang:type=Memory",
> >>> "modelerType" : "sun.management.MemoryImpl",
> >>> "ObjectPendingFinalizationCount" : 0,
> >>> "NonHeapMemoryUsage" : {
> >>> "committed" : 137166848,
> >>> "init" : 136773632,
> >>> "max" : 184549376,
> >>> "used" : 48667528
> >>> },
> >>> "HeapMemoryUsage" : {
> >>> "committed" : 2104754176,
> >>> "init" : 2147483648,
> >>> "max" : 2104754176,
> >>> "used" : 270075472
> >>> },
> >>> "Verbose" : true,
> >>> "ObjectName" : "java.lang:type=Memory"
> >>> } ],
> >>> "beans" : [ {
> >>> "name" : "Hadoop:service=HBase,name=RegionServer,sub=IPC",
> >>> "modelerType" : "RegionServer,sub=IPC",
> >>> "tag.Context" : "regionserver",
> >>> "tag.Hostname" : "hdp-w-1"
> >>> } ],
> >>> "beans" : [ {
> >>> "name" : "Hadoop:service=HBase,name=RegionServer,sub=Replication",
> >>> "modelerType" : "RegionServer,sub=Replication",
> >>> "tag.Context" : "regionserver",
> >>> "tag.Hostname" : "hdp-w-1"
> >>> } ],
> >>> "beans" : [ {
> >>> "name" : "Hadoop:service=HBase,name=RegionServer,sub=Server",
> >>> "modelerType" : "RegionServer,sub=Server",
> >>> "tag.Context" : "regionserver",
> >>> "tag.Hostname" : "hdp-w-1"
> >>> } ]
> >>> }
> >>> 2015-08-06 14:12:56,284 ERROR [sync.4] wal.FSHLog: Error syncing,
> >>> request close of wal
> >>> java.io.IOException: All datanodes DatanodeInfoWithStorage[
> >>> 10.240.187.182:50010,DS-8c63ac70-2f98-4084-91ee-a847b4f48ce2,DISK] are
> >>> bad. Aborting...
> >>> at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1117)
> >>> at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:909)
> >>> at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:412)
> >>> 2015-08-06 14:12:56,285 WARN
> >>> [regionserver/hdp-w-1.c.dks-hadoop.internal/10.240.2.235:16020
> .logRoller]
> >>> wal.FSHLog: Failed last sync but no outstanding unsync edits so falling
> >>> through to close; java.io.IOException: All datanodes
> >>> DatanodeInfoWithStorage[10.240.187.182:50010
> ,DS-8c63ac70-2f98-4084-91ee-a847b4f48ce2,DISK]
> >>> are bad. Aborting...
> >>> 2015-08-06 14:12:56,285 ERROR
> >>> [regionserver/hdp-w-1.c.dks-hadoop.internal/10.240.2.235:16020
> .logRoller]
> >>> wal.ProtobufLogWriter: Got IOException while writing trailer
> >>> java.io.IOException: All datanodes DatanodeInfoWithStorage[
> >>> 10.240.187.182:50010,DS-8c63ac70-2f98-4084-91ee-a847b4f48ce2,DISK] are
> >>> bad. Aborting...
> >>> at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1117)
> >>> at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:909)
> >>> at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:412)
> >>> 2015-08-06 14:12:56,285 WARN
> >>> [regionserver/hdp-w-1.c.dks-hadoop.internal/10.240.2.235:16020
> .logRoller]
> >>> wal.FSHLog: Riding over failed WAL close of
> >>>
> hdfs://hdp-m.c.dks-hadoop.internal:8020/apps/hbase/data/WALs/hdp-w-1.c.dks-hadoop.internal,16020,1438869946909/hdp-w-1.c.dks-hadoop.internal%2C16020%2C1438869946909.default.1438869950359,
> >>> cause="All datanodes DatanodeInfoWithStorage[10.240.187.182:50010
> ,DS-8c63ac70-2f98-4084-91ee-a847b4f48ce2,DISK]
> >>> are bad. Aborting...", errors=1; THIS FILE WAS NOT CLOSED BUT ALL EDITS
> >>> SYNCED SO SHOULD BE OK
> >>> 2015-08-06 14:12:56,287 INFO
> >>> [regionserver/hdp-w-1.c.dks-hadoop.internal/10.240.2.235:16020
> .logRoller]
> >>> wal.FSHLog: Rolled WAL
> >>>
> /apps/hbase/data/WALs/hdp-w-1.c.dks-hadoop.internal,16020,1438869946909/hdp-w-1.c.dks-hadoop.internal%2C16020%2C1438869946909.default.1438869950359
> >>> with entries=100, filesize=30.73 KB; new WAL
> >>>
> /apps/hbase/data/WALs/hdp-w-1.c.dks-hadoop.internal,16020,1438869946909/hdp-w-1.c.dks-hadoop.internal%2C16020%2C1438869946909.default.1438870376262
> >>> 2015-08-06 14:12:56,288 INFO
> >>> [regionserver/hdp-w-1.c.dks-hadoop.internal/10.240.2.235:16020
> .logRoller]
> >>> wal.FSHLog: Archiving
> >>>
> hdfs://hdp-m.c.dks-hadoop.internal:8020/apps/hbase/data/WALs/hdp-w-1.c.dks-hadoop.internal,16020,1438869946909/hdp-w-1.c.dks-hadoop.internal%2C16020%2C1438869946909.default.1438869950359
> >>> to
> >>>
> hdfs://hdp-m.c.dks-hadoop.internal:8020/apps/hbase/data/oldWALs/hdp-w-1.c.dks-hadoop.internal%2C16020%2C1438869946909.default.1438869950359
> >>> 2015-08-06 14:12:56,315 INFO [RS_CLOSE_REGION-hdp-w-1:16020-0]
> >>> regionserver.HRegionServer: STOPPED: Unrecoverable exception while
> closing
> >>> region
> >>>
> SYSTEM.SEQUENCE,\x7F\x00\x00\x00,1438013446516.918ed7c6568e7500fb434f4268c5bbc5.,
> >>> still finishing close
> >>> 2015-08-06 14:12:56,315 INFO
> >>> [regionserver/hdp-w-1.c.dks-hadoop.internal/10.240.2.235:16020]
> >>> regionserver.SplitLogWorker: Sending interrupt to stop the worker
> thread
> >>> 2015-08-06 14:12:56,315 ERROR [RS_CLOSE_REGION-hdp-w-1:16020-0]
> >>> executor.EventHandler: Caught throwable while processing event
> >>> M_RS_CLOSE_REGION
> >>> java.lang.RuntimeException: java.io.IOException: All datanodes
> >>> DatanodeInfoWithStorage[10.240.187.182:50010
> ,DS-8c63ac70-2f98-4084-91ee-a847b4f48ce2,DISK]
> >>> are bad. Aborting...
> >>> at
> >>>
> org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:152)
> >>> at
> >>>
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
> >>> at
> >>>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> >>> at
> >>>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >>> at java.lang.Thread.run(Thread.java:745)
> >>> Caused by: java.io.IOException: All datanodes DatanodeInfoWithStorage[
> >>> 10.240.187.182:50010,DS-8c63ac70-2f98-4084-91ee-a847b4f48ce2,DISK] are
> >>> bad. Aborting...
> >>> at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1117)
> >>> at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:909)
> >>> at
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:412)
> >>>
> >>> -------------
> >>> m datanode log
> >>> -------------
> >>> 2015-07-27 14:11:16,082 INFO datanode.DataNode
> >>> (BlockReceiver.java:run(1348)) - PacketResponder:
> >>> BP-369072949-10.240.200.196-1437998325049:blk_1073742677_1857,
> >>> type=HAS_DOWNSTREAM_IN_PIPELINE terminating
> >>> 2015-07-27 14:11:16,132 INFO datanode.DataNode
> >>> (DataXceiver.java:writeBlock(655)) - Receiving
> >>> BP-369072949-10.240.200.196-1437998325049:blk_1073742678_1858 src: /
> >>> 10.240.200.196:56767 dest: /10.240.200.196:50010
> >>> 2015-07-27 14:11:16,155 INFO DataNode.clienttrace
> >>> (BlockReceiver.java:finalizeBlock(1375)) - src: /10.240.200.196:56767,
> >>> dest: /10.240.200.196:50010, bytes: 117761, op: HDFS_WRITE, cliID:
> >>> DFSClient_NONMAPREDUCE_177514816_1, offset: 0, srvID:
> >>> 329bbe62-bcea-4a6d-8c97-e800631deb81, blockid:
> >>> BP-369072949-10.240.200.196-1437998325049:blk_1073742678_1858,
> duration:
> >>> 6385289
> >>> 2015-07-27 14:11:16,155 INFO datanode.DataNode
> >>> (BlockReceiver.java:run(1348)) - PacketResponder:
> >>> BP-369072949-10.240.200.196-1437998325049:blk_1073742678_1858,
> >>> type=HAS_DOWNSTREAM_IN_PIPELINE terminating
> >>> 2015-07-27 14:11:16,267 ERROR datanode.DataNode
> >>> (DataXceiver.java:run(278)) -
> hdp-m.c.dks-hadoop.internal:50010:DataXceiver
> >>> error processing unknown operation src: /127.0.0.1:60513 dst: /
> >>> 127.0.0.1:50010
> >>> java.io.EOFException
> >>> at java.io.DataInputStream.readShort(DataInputStream.java:315)
> >>> at
> >>>
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.readOp(Receiver.java:58)
> >>> at
> >>>
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:227)
> >>> at java.lang.Thread.run(Thread.java:745)
> >>> 2015-07-27 14:11:16,405 INFO datanode.DataNode
> >>> (DataNode.java:transferBlock(1943)) - DatanodeRegistration(
> >>> 10.240.200.196:50010,
> datanodeUuid=329bbe62-bcea-4a6d-8c97-e800631deb81,
> >>> infoPort=50075, infoSecurePort=0, ipcPort=8010,
> >>>
> storageInfo=lv=-56;cid=CID-1247f294-77a9-4605-b6d3-4c1398bb5db0;nsid=2032226938;c=0)
> >>> Starting thread to transfer
> >>> BP-369072949-10.240.200.196-1437998325049:blk_1073742649_1829 to
> >>> 10.240.2.235:50010 10.240.164.0:50010
> >>>
> >>> -------------
> >>> w-0 datanode log
> >>> -------------
> >>> 2015-07-27 14:11:25,019 ERROR datanode.DataNode
> >>> (DataXceiver.java:run(278)) -
> >>> hdp-w-0.c.dks-hadoop.internal:50010:DataXceiver error processing
> unknown
> >>> operation src: /127.0.0.1:47993 dst: /127.0.0.1:50010
> >>> java.io.EOFException
> >>> at java.io.DataInputStream.readShort(DataInputStream.java:315)
> >>> at
> >>>
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.readOp(Receiver.java:58)
> >>> at
> >>>
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:227)
> >>> at java.lang.Thread.run(Thread.java:745)
> >>> 2015-07-27 14:11:25,077 INFO DataNode.clienttrace
> >>> (DataXceiver.java:requestShortCircuitFds(369)) - src: 127.0.0.1, dest:
> >>> 127.0.0.1, op: REQUEST_SHORT_CIRCUIT_FDS, blockid: 1073742631, srvID:
> >>> a5eea5a8-5112-46da-9f18-64274486c472, success: true
> >>>
> >>>
> >>> -----------------------------
> >>> Thank you in advance,
> >>>
> >>> Adrià
> >>>
> >>>
> >>
> >>
>



Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message