hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stack <st...@duboce.net>
Subject Re: High Priority RPC handler usage with meta replication - block read errors?
Date Fri, 28 Jun 2019 05:23:10 GMT
If early 2.0 or 2.1 versions, perhaps it is HBASE-21292 ?
S

On Thu, Jun 27, 2019 at 10:13 PM Stack <stack@duboce.net> wrote:

> On Wed, Jun 26, 2019 at 6:47 AM James Kebinger <jkebinger@hubspot.com>
> wrote:
>
>> Hello,
>> We're experimenting with meta table replication
>
>
> Replicating the meta table itself? What you thinking?
>
>
>
>> but have found that the
>> region servers hosting the replicas will get extremely high priority RPC
>> handler usage, sometimes all the way to 100% at which point clients start
>> to experience errors - the priority RPC handler usage is much higher than
>> the region servers hosting non-replicated meta ever experienced.
>>
>
> Meta Table will be hot on a busy cluster. Adding replication load will up
> the temperature even more.
>
> Check out the James Moore suggestions. HDFS issue could occupy handler
> while it holds a the id lock on a popular block.
>
> Can you figure who is trying to read the removed file?
>
> Can't tell for sure from the snippet below but are handlers frequently
> going out to HDFS to pull in blocks rather than fetch from block cache? How
> is server-side cache on the server hosting meta? Can you get more meta up
> into block cache?
>
> If you disable replication, all settles down?
>
> S
>
>
> Has anyone seen this behavior? Are we doing something wrong?
>> Our settings changes are
>>
>> hbase.regionserver.meta.storefile.refresh.period: 30000
>> hbase.meta.replica.count: 3
>> hbase.master.hfilecleaner.ttl: 600000
>>
>> We're on cdh 5.9.0 (hbase 1.2)
>>
>> We see some occasional errors like this one below (at the time I'm writing
>> this 30 minutes after the log entry, that file doesn't exist)
>> java.io.IOException: Got error for OP_READ_BLOCK, status=ERROR,
>> self=/REDACTED:11702, remote=REDACTED:50010, for file
>> /hbase/data/hbase/meta/1588230740/info/85c3e4b648eb437a8ef62ff9b39f525d,
>> for pool BP-812715994-REDACTED-1441218207856 block 1083723183_9985162
>>         at
>>
>> org.apache.hadoop.hdfs.RemoteBlockReader2.checkSuccess(RemoteBlockReader2.java:467)
>>         at
>>
>> org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:432)
>>         at
>>
>> org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:881)
>>         at
>>
>> org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:759)
>>         at
>>
>> org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:376)
>>
>> Doing a stack dump of a region server process to see what the priority
>> handlers are up to reveals the following:
>>
>> 1 times (0% total cpu)
>> "RpcServer.Priority.general.handler=12,queue=0,port=60020" #138 daemon
>> prio=5 os_prio=0 tid=0x00007fc5471a3000 nid=0x3f33 waiting on condition
>> [0x00007fc4f5870000] cpu=0.0 start=2019-06-20 19:31:47
>>    java.lang.Thread.State: WAITING (parking)
>> at sun.misc.Unsafe.park(Native Method)
>> - parking to wait for  <0x00000003d94b6f18> (a
>> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
>> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>> at
>>
>> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>> at
>>
>> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
>> at
>>
>> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
>> at
>>
>> java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:943)
>> at
>>
>> org.apache.hadoop.hbase.regionserver.HStore.replaceStoreFiles(HStore.java:1348)
>> at
>>
>> org.apache.hadoop.hbase.regionserver.HStore.refreshStoreFilesInternal(HStore.java:652)
>> --
>>
>> 3 times (0% total cpu)
>> "RpcServer.Priority.general.handler=3,queue=1,port=60020" #129 daemon
>> prio=5 os_prio=0 tid=0x00007fc547190800 nid=0x3f2a waiting on condition
>> [0x00007fc4f6178000] cpu=0.0 start=2019-06-20 19:31:47
>>    java.lang.Thread.State: WAITING (parking)
>> at sun.misc.Unsafe.park(Native Method)
>> - parking to wait for  <0x0000000723bbee00> (a
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>> at
>>
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
>> at
>>
>> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
>> at
>>
>> java.util.concurrent.ExecutorCompletionService.take(ExecutorCompletionService.java:193)
>> at
>>
>> org.apache.hadoop.hdfs.DFSInputStream.getFirstToComplete(DFSInputStream.java:1359)
>> at
>>
>> org.apache.hadoop.hdfs.DFSInputStream.hedgedFetchBlockByteRange(DFSInputStream.java:1327)
>> at org.apache.hadoop.hdfs.DFSInputStream.pread(DFSInputStream.java:1468)
>> --
>>
>> 5 times (0% total cpu)
>> "RpcServer.Priority.general.handler=2,queue=0,port=60020" #128 daemon
>> prio=5 os_prio=0 tid=0x00007fc54718f000 nid=0x3f29 in Object.wait()
>> [0x00007fc4f6279000] cpu=0.0 start=2019-06-20 19:31:47
>>    java.lang.Thread.State: WAITING (on object monitor)
>> at java.lang.Object.wait(Native Method)
>> at java.lang.Object.wait(Object.java:502)
>> at org.apache.hadoop.hbase.util.IdLock.getLockEntry(IdLock.java:81)
>> - locked <0x00000004c0878978> (a
>> org.apache.hadoop.hbase.util.IdLock$Entry)
>> at
>> org.apache.hadoop.hbase.io
>> .hfile.HFileReaderV2.readBlock(HFileReaderV2.java:407)
>> at
>> org.apache.hadoop.hbase.io
>> .hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:266)
>> at
>> org.apache.hadoop.hbase.io
>> .hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:642)
>> at
>> org.apache.hadoop.hbase.io
>> .hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:592)
>> at
>>
>> org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:294)
>> --
>>
>> 7 times (0% total cpu)
>> "RpcServer.Priority.general.handler=1,queue=1,port=60020" #127 daemon
>> prio=5 os_prio=0 tid=0x00007fc54718c800 nid=0x3f28 in Object.wait()
>> [0x00007fc4f637a000] cpu=0.0 start=2019-06-20 19:31:47
>>    java.lang.Thread.State: WAITING (on object monitor)
>> at java.lang.Object.wait(Native Method)
>> at java.lang.Object.wait(Object.java:502)
>> at org.apache.hadoop.hbase.util.IdLock.getLockEntry(IdLock.java:81)
>> - locked <0x0000000723bbef88> (a
>> org.apache.hadoop.hbase.util.IdLock$Entry)
>> at
>> org.apache.hadoop.hbase.io
>> .hfile.HFileReaderV2.readBlock(HFileReaderV2.java:407)
>> at
>> org.apache.hadoop.hbase.io
>> .hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:266)
>> at
>> org.apache.hadoop.hbase.io
>> .hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:642)
>> at
>> org.apache.hadoop.hbase.io
>> .hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:592)
>> at
>>
>> org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:294)
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message