hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: HBase file encryption, inconsistencies observed and data loss
Date Wed, 30 Jul 2014 19:50:55 GMT
In BaseDecoder#rethrowEofException() :

    if (!isEof) throw ioEx;

    LOG.error("Partial cell read caused by EOF: " + ioEx);

    EOFException eofEx = new EOFException("Partial cell read");

    eofEx.initCause(ioEx);

    throw eofEx;

throwing EOFException would not propagate the "Partial cell read" condition
to HLogSplitter which doesn't treat EOFException as an error.

I think a new exception type (DecoderException e.g.) should be used above.

Cheers


On Wed, Jul 30, 2014 at 10:22 AM, Ted Yu <yuzhihong@gmail.com> wrote:

> Looking at HLogSplitter#getNextLogLine() :
>
>     try {
>
>       return in.next();
>
>     } catch (EOFException eof) {
>
>       // truncated files are expected if a RS crashes (see HBASE-2643)
>
>       LOG.info("EOF from hlog " + path + ".  continuing");
>
>       return null;
>
> The EOFException is not treated as an error. But the posted log doesn't
> contain "EOF from hlog " - there may be other code path leading to
> codec.BaseDecoder
>
> Cheers
>
>
> On Wed, Jul 30, 2014 at 9:20 AM, Kiran Kumar.M.R <
> Kiran.Kumar.MR@huawei.com> wrote:
>
>> Hi,
>>
>> After step 4 ( i.e disabling of WAL encryption, removing
>> SecureProtobufReader/Writer and restart), read of encrypted WAL fails
>> mainly due to EOF exception at Basedecoder. This is not considered as error
>> and these WAL are being moved to /oldWALs.
>>
>> Following is observed in log files:
>> 2014-07-30 19:44:29,254 INFO  [RS_LOG_REPLAY_OPS-HOST-16:15264-1]
>> wal.HLogSplitter: Splitting hlog:
>> hdfs://HOST-16:18020/hbase/WALs/HOST-16,15264,1406725441997-splitting/HOST-16%2C15264%2C1406725441997.1406725444017,
>> length=172
>> 2014-07-30 19:44:29,254 INFO  [RS_LOG_REPLAY_OPS-HOST-16:15264-1]
>> wal.HLogSplitter: DistributedLogReplay = false
>> 2014-07-30 19:44:29,313 INFO  [RS_LOG_REPLAY_OPS-HOST-16:15264-1]
>> util.FSHDFSUtils: Recovering lease on dfs file
>> hdfs://HOST-16:18020/hbase/WALs/HOST-16,15264,1406725441997-splitting/HOST-16%2C15264%2C1406725441997.1406725444017
>> 2014-07-30 19:44:29,315 INFO  [RS_LOG_REPLAY_OPS-HOST-16:15264-1]
>> util.FSHDFSUtils: recoverLease=true, attempt=0 on
>> file=hdfs://HOST-16:18020/hbase/WALs/HOST-16,15264,1406725441997-splitting/HOST-16%2C15264%2C1406725441997.1406725444017
>> after 1ms
>> 2014-07-30 19:44:29,429 DEBUG
>> [RS_LOG_REPLAY_OPS-HOST-16:15264-1-Writer-0] wal.HLogSplitter: Writer
>> thread Thread[RS_LOG_REPLAY_OPS-HOST-16:15264-1-Writer-0,5,main]: starting
>> 2014-07-30 19:44:29,429 DEBUG
>> [RS_LOG_REPLAY_OPS-HOST-16:15264-1-Writer-1] wal.HLogSplitter: Writer
>> thread Thread[RS_LOG_REPLAY_OPS-HOST-16:15264-1-Writer-1,5,main]: starting
>> 2014-07-30 19:44:29,430 DEBUG
>> [RS_LOG_REPLAY_OPS-HOST-16:15264-1-Writer-2] wal.HLogSplitter: Writer
>> thread Thread[RS_LOG_REPLAY_OPS-HOST-16:15264-1-Writer-2,5,main]: starting
>> 2014-07-30 19:44:29,591 ERROR [RS_LOG_REPLAY_OPS-HOST-16:15264-1]
>> codec.BaseDecoder: Partial cell read caused by EOF: java.io.IOException:
>> Premature EOF from inputStream
>> 2014-07-30 19:44:29,592 INFO  [RS_LOG_REPLAY_OPS-HOST-16:15264-1]
>> wal.HLogSplitter: Finishing writing output logs and closing down.
>> 2014-07-30 19:44:29,592 INFO  [RS_LOG_REPLAY_OPS-HOST-16:15264-1]
>> wal.HLogSplitter: Waiting for split writer threads to finish
>> 2014-07-30 19:44:29,592 INFO  [RS_LOG_REPLAY_OPS-HOST-16:15264-1]
>> wal.HLogSplitter: Split writers finished
>> 2014-07-30 19:44:29,592 INFO  [RS_LOG_REPLAY_OPS-HOST-16:15264-1]
>> wal.HLogSplitter: Processed 0 edits across 0 regions; log
>> file=hdfs://HOST-16:18020/hbase/WALs/HOST-16,15264,1406725441997-splitting/HOST-16%2C15264%2C1406725441997.1406725444017
>> is corrupted = false progress failed = false
>>
>> To fix this, we need to propagate EOF exception to HLogSplitter. Any
>> suggestions on the fix?
>>
>>
>> Regards,
>> Kiran
>>
>> __________________________________________________________________________________________________________
>> This e-mail and its attachments contain confidential information from
>> HUAWEI, which is intended only for the person or entity whose address is
>> listed above. Any use of the information contained herein in any way
>> (including, but not limited to, total or partial disclosure, reproduction,
>> or dissemination) by persons other than the intended recipient(s) is
>> prohibited. If you receive this e-mail in error, please notify the sender
>> by phone or email immediately and delete it!
>>
>> __________________________________________________________________________________________________________
>>
>>
>>
>>
>> > -----Original Message-----
>> > From: Shankar hiremath [mailto:shankar.hiremath@huawei.com]
>> > Sent: Friday, July 25, 2014 19:38
>> > To: user@hbase.apache.org
>> > Subject: HBase file encryption, inconsistencies observed and data loss
>> >
>> > HBase file encryption some inconsistencies observed and data loss
>> > happens after running the hbck tool,
>> > the operation steps are as below.    (one thing what I observed is, on
>> > startup of HMaster if it is not able to process the WAL file, then also
>> > it moved to /oldWALs)
>> >
>> > Procedure:
>> > 1. Start the Hbase services (HMaster & region Server) 2. Enable HFile
>> > encryption and WAL file encryption as below, and perform 'table4-0' put
>> > operations (100 records added) <property>
>> > <name>hbase.crypto.keyprovider</name>
>> > <value>org.apache.hadoop.hbase.io.crypto.KeyStoreKeyProvider</value>
>> > </property>
>> > <property>
>> > <name>hbase.crypto.keyprovider.parameters</name>
>> > <value>jceks:///opt/shankar1/kdc_keytab/hbase.jks?password=Hadoop@234</
>> > value>
>> > </property>
>> > <property>
>> > <name>hbase.crypto.master.key.name</name>
>> > <value>hdfs</value>
>> > </property>
>> > <property>
>> > <name>hfile.format.version</name>
>> > <value>3</value>
>> > </property>
>> > <property>
>> > <name>hbase.regionserver.hlog.reader.impl</name>
>> > <value>org.apache.hadoop.hbase.regionserver.wal.SecureProtobufLogReader
>> > </value>
>> > </property>
>> > <property>
>> > <name>hbase.regionserver.hlog.writer.impl</name>
>> > <value>org.apache.hadoop.hbase.regionserver.wal.SecureProtobufLogWriter
>> > </value>
>> > </property>
>> > <property>
>> > <name>hbase.regionserver.wal.encryption</name>
>> > <value>true</value>
>> > </property>
>> > 3. Machine went down, so all process went down
>> >
>> > 4. We disabled the WAL file encryption for performance reason, and keep
>> > encryption only for Hfile, as below <property>
>> > <name>hbase.crypto.keyprovider</name>
>> > <value>org.apache.hadoop.hbase.io.crypto.KeyStoreKeyProvider</value>
>> > </property>
>> > <property>
>> > <name>hbase.crypto.keyprovider.parameters</name>
>> > <value>jceks:///opt/shankar1/kdc_keytab/hbase.jks?password=Hadoop@234</
>> > value>
>> > </property>
>> > <property>
>> > <name>hbase.crypto.master.key.name</name>
>> > <value>hdfs</value>
>> > </property>
>> > <property>
>> > <name>hfile.format.version</name>
>> > <value>3</value>
>> > </property>
>> > 5. Start the Region Server and query the 'table4-0' data
>> > hbase(main):003:0> count 'table4-0'
>> > ERROR: org.apache.hadoop.hbase.NotServingRegionException: Region
>> > table4-0,,1406207815456.fc10620a3dcc14e004ab034420f7d332. is not online
>> > on XX-XX-XX-XX,60020,1406209023146 at
>> > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedNa
>> > me(HRegionServer.java:2685)
>> > at
>> > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionSer
>> > ver.java:4119)
>> > at
>> > org.apache.hadoop.hbase.regionserver.HRegionServer.scan(HRegionServer.j
>> > ava:3066)
>> > at
>> > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2
>> > .callBlockingMethod(ClientProtos.java:29497)
>> > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2084)
>> > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:98)
>> > at
>> > org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcSc
>> > heduler.java:168)
>> > at
>> > org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcSche
>> > duler.java:39)
>> > at
>> > org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler
>> > .java:111)
>> > at java.lang.Thread.run(Thread.java:662)
>> > 6. Not able to read the data, so we decided to revert back the
>> > configuration (as original) 7. Kill/Stop the Region Server, revert all
>> > the configurations as original, as below <property>
>> > <name>hbase.crypto.keyprovider</name>
>> > <value>org.apache.hadoop.hbase.io.crypto.KeyStoreKeyProvider</value>
>> > </property>
>> > <property>
>> > <name>hbase.crypto.keyprovider.parameters</name>
>> > <value>jceks:///opt/shankar1/kdc_keytab/hbase.jks?password=Hadoop@234</
>> > value>
>> > </property>
>> > <property>
>> > <name>hbase.crypto.master.key.name</name>
>> > <value>hdfs</value>
>> > </property>
>> > <property>
>> > <name>hfile.format.version</name>
>> > <value>3</value>
>> > </property>
>> > <property>
>> > <name>hbase.regionserver.hlog.reader.impl</name>
>> > <value>org.apache.hadoop.hbase.regionserver.wal.SecureProtobufLogReader
>> > </value>
>> > </property>
>> > <property>
>> > <name>hbase.regionserver.hlog.writer.impl</name>
>> > <value>org.apache.hadoop.hbase.regionserver.wal.SecureProtobufLogWriter
>> > </value>
>> > </property>
>> > <property>
>> > <name>hbase.regionserver.wal.encryption</name>
>> > <value>true</value>
>> > </property>
>> > 7. Start the Region Server, and perform the 'table4-0' query
>> > hbase(main):003:0> count 'table4-0'
>> > ERROR: org.apache.hadoop.hbase.NotServingRegionException: Region
>> > table4-0,,1406207815456.fc10620a3dcc14e004ab034420f7d332. is not online
>> > on XX-XX-XX-XX,60020,1406209023146 at
>> > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedNa
>> > me(HRegionServer.java:2685)
>> > at
>> > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionSer
>> > ver.java:4119)
>> > at
>> > org.apache.hadoop.hbase.regionserver.HRegionServer.scan(HRegionServer.j
>> > ava:3066)
>> > at
>> > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2
>> > .callBlockingMethod(ClientProtos.java:29497)
>> > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2084)
>> > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:98)
>> > at
>> > org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcSc
>> > heduler.java:168)
>> > at
>> > org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcSche
>> > duler.java:39)
>> > at
>> > org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler
>> > .java:111)
>> > at java.lang.Thread.run(Thread.java:662)
>> > 8. Run the hbase hbck to repair, as below ./hbase hbck -
>> > details .........................
>> > Summary:
>> > table1-0 is okay.
>> > Number of regions: 0
>> > Deployed on:
>> > table2-0 is okay.
>> > Number of regions: 0
>> > Deployed on:
>> > table3-0 is okay.
>> > Number of regions: 0
>> > Deployed on:
>> > table4-0 is okay.
>> > Number of regions: 0
>> > Deployed on:
>> > table5-0 is okay.
>> > Number of regions: 0
>> > Deployed on:
>> > table6-0 is okay.
>> > Number of regions: 0
>> > Deployed on:
>> > table7-0 is okay.
>> > Number of regions: 0
>> > Deployed on:
>> > table8-0 is okay.
>> > Number of regions: 0
>> > Deployed on:
>> > table9-0 is okay.
>> > Number of regions: 0
>> > Deployed on:
>> > hbase:meta is okay.
>> > Number of regions: 1
>> > Deployed on: XX-XX-XX-XX,60020,1406209023146 hbase:acl is okay.
>> > Number of regions: 0
>> > Deployed on:
>> > hbase:namespace is okay.
>> > Number of regions: 0
>> > Deployed on:
>> > 22 inconsistencies detected.
>> > Status: INCONSISTENT
>> > 2014-07-24 19:13:05,532 INFO [main]
>> > client.HConnectionManager$HConnectionImplementation: Closing master
>> > protocol: MasterService
>> > 2014-07-24 19:13:05,533 INFO [main]
>> > client.HConnectionManager$HConnectionImplementation: Closing zookeeper
>> > sessionid=0x1475d1611611bcf
>> > 2014-07-24 19:13:05,533 DEBUG [main] zookeeper.ZooKeeper: Closing
>> > session: 0x1475d1611611bcf
>> > 2014-07-24 19:13:05,533 DEBUG [main] zookeeper.ClientCnxn: Closing
>> > client for session: 0x1475d1611611bcf
>> > 2014-07-24 19:13:05,546 DEBUG [main-SendThread(XX-XX-XX-XX:2181)]
>> > zookeeper.ClientCnxn: Reading reply sessionid:0x1475d1611611bcf,
>> > packet:: clientPath:null serverPath:null finished:false header:: 6,-11
>> > replyHeader:: 6,4295102074,0 request:: null response:: null
>> > 2014-07-24 19:13:05,546 DEBUG [main] zookeeper.ClientCnxn:
>> > Disconnecting client for session: 0x1475d1611611bcf
>> > 2014-07-24 19:13:05,546 DEBUG [main-SendThread(XX-XX-XX-XX:2181)]
>> > zookeeper.ClientCnxn: An exception was thrown while closing send thread
>> > for session 0x1475d1611611bcf : Unable to read additional data from
>> > server sessionid 0x1475d1611611bcf, likely server has closed socket
>> > 2014-07-24 19:13:05,546 INFO [main-EventThread] zookeeper.ClientCnxn:
>> > EventThread shut down
>> > 2014-07-24 19:13:05,546 INFO [main] zookeeper.ZooKeeper: Session:
>> > 0x1475d1611611bcf closed shankar1@XX-XX-XX-XX:~/DataSight/hbase/bin>
>> > 9. Fix the assignments as below
>> > ./hbase hbck -fixAssignments
>> > Summary:
>> > table1-0 is okay.
>> > Number of regions: 1
>> > Deployed on: XX-XX-XX-XX,60020,1406209023146
>> > table2-0 is okay.
>> > Number of regions: 1
>> > Deployed on: XX-XX-XX-XX,60020,1406209023146
>> > table3-0 is okay.
>> > Number of regions: 1
>> > Deployed on: XX-XX-XX-XX,60020,1406209023146
>> > table4-0 is okay.
>> > Number of regions: 1
>> > Deployed on: XX-XX-XX-XX,60020,1406209023146
>> > table5-0 is okay.
>> > Number of regions: 1
>> > Deployed on: XX-XX-XX-XX,60020,1406209023146
>> > table6-0 is okay.
>> > Number of regions: 1
>> > Deployed on: XX-XX-XX-XX,60020,1406209023146
>> > table7-0 is okay.
>> > Number of regions: 1
>> > Deployed on: XX-XX-XX-XX,60020,1406209023146
>> > table8-0 is okay.
>> > Number of regions: 1
>> > Deployed on: XX-XX-XX-XX,60020,1406209023146
>> > table9-0 is okay.
>> > Number of regions: 1
>> > Deployed on: XX-XX-XX-XX,60020,1406209023146 hbase:meta is okay.
>> > Number of regions: 1
>> > Deployed on: XX-XX-XX-XX,60020,1406209023146 hbase:acl is okay.
>> > Number of regions: 1
>> > Deployed on: XX-XX-XX-XX,60020,1406209023146 hbase:namespace is okay.
>> > Number of regions: 1
>> > Deployed on: XX-XX-XX-XX,60020,1406209023146
>> > 0 inconsistencies detected.
>> > Status: OK
>> > 2014-07-24 19:44:55,194 INFO [main]
>> > client.HConnectionManager$HConnectionImplementation: Closing master
>> > protocol: MasterService
>> > 2014-07-24 19:44:55,194 INFO [main]
>> > client.HConnectionManager$HConnectionImplementation: Closing zookeeper
>> > sessionid=0x2475d15f7b31b73
>> > 2014-07-24 19:44:55,194 DEBUG [main] zookeeper.ZooKeeper: Closing
>> > session: 0x2475d15f7b31b73
>> > 2014-07-24 19:44:55,194 DEBUG [main] zookeeper.ClientCnxn: Closing
>> > client for session: 0x2475d15f7b31b73
>> > 2014-07-24 19:44:55,203 DEBUG [main-SendThread(XX-XX-XX-XX:2181)]
>> > zookeeper.ClientCnxn: Reading reply sessionid:0x2475d15f7b31b73,
>> > packet:: clientPath:null serverPath:null finished:false header:: 7,-11
>> > replyHeader:: 7,4295102377,0 request:: null response:: null
>> > 2014-07-24 19:44:55,203 DEBUG [main] zookeeper.ClientCnxn:
>> > Disconnecting client for session: 0x2475d15f7b31b73
>> > 2014-07-24 19:44:55,204 DEBUG [main-SendThread(XX-XX-XX-XX:2181)]
>> > zookeeper.ClientCnxn: An exception was thrown while closing send thread
>> > for session 0x2475d15f7b31b73 : Unable to read additional data from
>> > server sessionid 0x2475d15f7b31b73, likely server has closed socket
>> > 2014-07-24 19:44:55,204 INFO [main] zookeeper.ZooKeeper: Session:
>> > 0x2475d15f7b31b73 closed
>> > 2014-07-24 19:44:55,204 INFO [main-EventThread] zookeeper.ClientCnxn:
>> > EventThread shut down 10. Fix the assignments as below ./hbase hbck -
>> > fixAssignments -fixMeta
>> > Summary:
>> > table1-0 is okay.
>> > Number of regions: 1
>> > Deployed on: XX-XX-XX-XX,60020,1406209023146
>> > table2-0 is okay.
>> > Number of regions: 1
>> > Deployed on: XX-XX-XX-XX,60020,1406209023146
>> > table3-0 is okay.
>> > Number of regions: 1
>> > Deployed on: XX-XX-XX-XX,60020,1406209023146
>> > table4-0 is okay.
>> > Number of regions: 1
>> > Deployed on: XX-XX-XX-XX,60020,1406209023146
>> > table5-0 is okay.
>> > Number of regions: 1
>> > Deployed on: XX-XX-XX-XX,60020,1406209023146
>> > table6-0 is okay.
>> > Number of regions: 1
>> > Deployed on: XX-XX-XX-XX,60020,1406209023146
>> > table7-0 is okay.
>> > Number of regions: 1
>> > Deployed on: XX-XX-XX-XX,60020,1406209023146
>> > table8-0 is okay.
>> > Number of regions: 1
>> > Deployed on: XX-XX-XX-XX,60020,1406209023146
>> > table9-0 is okay.
>> > Number of regions: 1
>> > Deployed on: XX-XX-XX-XX,60020,1406209023146 hbase:meta is okay.
>> > Number of regions: 1
>> > Deployed on: XX-XX-XX-XX,60020,1406209023146 hbase:acl is okay.
>> > Number of regions: 1
>> > Deployed on: XX-XX-XX-XX,60020,1406209023146 hbase:namespace is okay.
>> > Number of regions: 1
>> > Deployed on: XX-XX-XX-XX,60020,1406209023146
>> > 0 inconsistencies detected.
>> > Status: OK
>> > 2014-07-24 19:46:16,290 INFO [main]
>> > client.HConnectionManager$HConnectionImplementation: Closing master
>> > protocol: MasterService
>> > 2014-07-24 19:46:16,290 INFO [main]
>> > client.HConnectionManager$HConnectionImplementation: Closing zookeeper
>> > sessionid=0x3475d1605321be9
>> > 2014-07-24 19:46:16,290 DEBUG [main] zookeeper.ZooKeeper: Closing
>> > session: 0x3475d1605321be9
>> > 2014-07-24 19:46:16,290 DEBUG [main] zookeeper.ClientCnxn: Closing
>> > client for session: 0x3475d1605321be9
>> > 2014-07-24 19:46:16,300 DEBUG [main-SendThread(XX-XX-XX-XX:2181)]
>> > zookeeper.ClientCnxn: Reading reply sessionid:0x3475d1605321be9,
>> > packet:: clientPath:null serverPath:null finished:false header:: 6,-11
>> > replyHeader:: 6,4295102397,0 request:: null response:: null
>> > 2014-07-24 19:46:16,300 DEBUG [main] zookeeper.ClientCnxn:
>> > Disconnecting client for session: 0x3475d1605321be9
>> > 2014-07-24 19:46:16,300 DEBUG [main-SendThread(XX-XX-XX-XX:2181)]
>> > zookeeper.ClientCnxn: An exception was thrown while closing send thread
>> > for session 0x3475d1605321be9 : Unable to read additional data from
>> > server sessionid 0x3475d1605321be9, likely server has closed socket
>> > 2014-07-24 19:46:16,300 INFO [main] zookeeper.ZooKeeper: Session:
>> > 0x3475d1605321be9 closed
>> > 2014-07-24 19:46:16,300 INFO [main-EventThread] zookeeper.ClientCnxn:
>> > EventThread shut down hbase(main):006:0> count 'table4-0'
>> > 0 row(s) in 0.0200 seconds
>> > => 0
>> > hbase(main):007:0>
>> > Complete data loss happened,
>> > WALs, oldWALs & /hbase/data/default/table4-0/ does not have any data
>> >
>> >
>> >
>> > [X]
>> > This e-mail and its attachments contain confidential information from
>> > HUAWEI, which is intended only for the person or entity whose address
>> > is listed above. Any use of the information contained herein in any way
>> > (including, but not limited to, total or partial disclosure,
>> > reproduction, or dissemination) by persons other than the intended
>> > recipient(s) is prohibited. If you receive this e-mail in error, please
>> > notify the sender by phone or email immediately and delete it!
>> > [X]
>> >
>> >
>> >
>> >
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message