hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "LeBlanc, Jacob" <jacob.lebl...@microfocus.com>
Subject Any way to avoid HBASE-21069?
Date Mon, 26 Nov 2018 19:49:25 GMT
Hi,

We've recently upgraded our production clusters to 1.4.6. We have jobs periodically run that
take snapshots of some of our hbase tables and these jobs seem to be running into https://issues.apache.org/jira/browse/HBASE-21069.
I understand there was a missing null check, but in the bug I don't really see any explanation
of how the null occurs in the first place. For those of us running 1.4.6, is there anything
we can do to avoid hitting the bug?

This problem is made worse because we are running a cluster in AWS EMR, meaning our WAL is
on a different filesystem (HDFS) and the hbase root directory (EMRFS), and we are hitting
some sort of issue where sometimes the master gets stuck while splitting a WAL from the crashed
region server:

2018-11-20 12:01:58,599 ERROR [split-log-closeStream-2] wal.WALSplitter: Couldn't rename s3://cmx-emr-hbase-us-west-2-oregon/hbase/data/default/upload_metadata_v2/3f98fcda5f711b29af28e9613d4b833b/recovered.edits/0000000000165359708-ip-172-20-113-197.us-west-2.compute.internal%2C16020%2C1542620776146.1542673338055.temp
to s3://cmx-emr-hbase-us-west-2-oregon/hbase/data/default/upload_metadata_v2/3f98fcda5f711b29af28e9613d4b833b/recovered.edits/0000000000165359720
java.io.IOException: Cannot get log reader
                at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:365)
                at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:277)
                at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:265)
                at org.apache.hadoop.hbase.wal.WALSplitter$LogRecoveredEditsOutputSink.deleteOneWithFewerEntries(WALSplitter.java:1363)
                at org.apache.hadoop.hbase.wal.WALSplitter$LogRecoveredEditsOutputSink.closeWriter(WALSplitter.java:1496)
                at org.apache.hadoop.hbase.wal.WALSplitter$LogRecoveredEditsOutputSink$2.call(WALSplitter.java:1448)
                at org.apache.hadoop.hbase.wal.WALSplitter$LogRecoveredEditsOutputSink$2.call(WALSplitter.java:1445)
                at java.util.concurrent.FutureTask.run(FutureTask.java:266)
                at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
                at java.util.concurrent.FutureTask.run(FutureTask.java:266)
                at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
                at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalArgumentException: Wrong FS: s3://cmx-emr-hbase-us-west-2-oregon/hbase/data/default/upload_metadata_v2/3f98fcda5f711b29af28e9613d4b833b/recovered.edits/0000000000165359720,
expected: hdfs://ip-172-20-113-83.us-west-2.compute.internal:8020
                at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:669)
                at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:214)
                at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:329)
                at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:325)
                at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
                at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:337)
                at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:790)
                at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:303)
                ... 12 more

It seems like https://issues.apache.org/jira/browse/HBASE-20723 did not hit all use cases.
My understanding is that in 1.4.8 the recovered edits are collocated with the WAL so this
will no longer be an issue (https://issues.apache.org/jira/browse/HBASE-20734) but AWS has
yet to release an EMR with 1.4.8 so this is causing us pain right now when we hit this situation
(it doesn't seem to happen every time a region server crashes - only twice so far).

Unfortunately because we are running an AWS EMR cluster, so we can't really just patch the
region servers ourselves. We have the option of upgrading to 1.4.7 to get the fix for HBASE-21069,
 but that will take us a little time to test, release, and schedule downtime for our application
so any mitigating steps we could take in the meantime would be appreciated.

Thanks,

--Jacob LeBlanc



Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message