hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Daniel Cryans <jdcry...@apache.org>
Subject Re: HBase strangeness and double deletes of HDFS blocks and writing to closed blocks
Date Fri, 01 Apr 2011 18:19:01 GMT
Thanks for taking the time to upload all those logs, I really appreciate it.

So from the looks of it, only 1 region wasn't able to split during the
time of those logs and it successfully rolled back. At first I thought
the data could have been deleted in the parent region, but we don't do
that in the region server (it's the master that's responsible for that
deletion) meaning that you couldn't lose data.

Which makes me think, those rows that are missing... are they part of
that region or they are also in other regions? If it's the case, then
maybe this is just a red herring.

You say tat you insert in two different families at different row
keys. IIUC that means you would insert row A in family f1 and row B in
family f2, and so on. And you say only one of the rows is there... I
guess you don't really mean that you were inserting into 2 rows for 11
hours and one of them was missing right? More like, all the data in
one family was missing for those 11B rows? Is that right?

Thx!

J-D

On Thu, Mar 31, 2011 at 7:15 PM, Chris Tarnas <cft@email.com> wrote:
> Thanks for your help J.D., answers inline:
>
> On Mar 31, 2011, at 8:00 PM, Jean-Daniel Cryans wrote:
>
>> I wouldn't worry too much at the moment for what seems to be double
>> deletes of blocks, I'd like to concentrate on the state of your
>> cluster first.
>>
>> So if you run hbck, do you see any inconsistencies?
>>
>
> Clean, no inconsistencies, as is hadoop fsck
>
>> In the datanode logs, do you see any exceptions regarding xcievers
>> (just in case).
>>
>
> The only place is shows up is the BlockAlreadyExistsException:
> 2011-03-29 22:55:12,726 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.56.24.12:50010,
storageID=DS-1245200625-10.56.24.12-50010-1297226452434, infoPort=50075, ipcPort=50020):DataXceiver
> org.apache.hadoop.hdfs.server.datanode.BlockAlreadyExistsException: Block blk_5380327420628627923_1036814
is valid, and cannot be written to.
>        at org.apache.hadoop.hdfs.server.datanode.FSDataset.writeToBlock(FSDataset.java:1314)
>        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(BlockReceiver.java:99)
>        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:318)
>        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:122)
>
>
>
>
>
>> In the region server logs, after the split is failed, do you see a
>> line that starts with "Successful rollback of failed split of..."? I
>> would hope so as the region server would have aborted in case the
>> split failed.
>>
>
> Yes, the rollback was successful.
>
>> Actually would it be possible to see bigger logs? As much as I like
>> the deep investigation you did, I don't want to go down a single debug
>> path when something else might have happened and never got a chance to
>> see it. Very very often, what you think is HDFS issues can turn out to
>> be network issues (did you do those checks?).
>>
>
> I didn't see any issues with the network (no errors, dropped packets, etc).
>
> Here is more of the master log:
>
> http://pastebin.com/PZqDZ1TV
>
> Regionserver:
>
> http://pastebin.com/pCDMPbB2
>
> Datanode had to snip some stuff to fit in pastebin:
>
> http://pastebin.com/DrWzgykH
>
>
>
>> Same for the master logs, it'd be nice to look for any weirdness.
>>
>> Thx!
>>
>> J-D
>>
>> On Thu, Mar 31, 2011 at 5:03 PM, Christopher Tarnas <cft@email.com> wrote:
>>> I've been trying to track down some hbase strangeness from what looks to be
>>> lost hbase puts: in one thrift put we insert data into two different column
>>> families at different rowkeys, but only one of the rows is there.  There
>>> were no errors to the client or the thrift log, which is a little
>>> disturbing. This is fairly well tested code that has worked flawlessly up to
>>> this point so I started to look for problems in the hbase and hadoop logs.
>>> We are using CDH3b4.
>>>
>>> This happened during a large >1 billion row load over 11 hours . In the
>>> regionserver logs it looks like the they were having trouble talking with
>>> the datanodes during splits. I'd see worrying stuff like this in the
>>> regionserver log:
>>>
>>> 2011-03-29 22:55:12,946 INFO org.apache.hadoop.hdfs.DFSClient: Could not
>>> complete file
>>> /hbase/sequence/d13ef276819124e550bb6e0be9c5cdc8/splits/c6f102f1781897a0bd2025bd8252c3cd/core/9219969397457794945.d13ef276819124e550bb6e0be9c5cdc8
>>> retrying...
>>> 2011-03-29 22:55:23,628 INFO
>>> org.apache.hadoop.hbase.regionserver.CompactSplitThread: Running rollback of
>>> failed split of
>>> sequence,MI6Q0pMyu3mQDR7hp71RBA\x093d6ae329777234c5ae9019cf8f5cfe80-A2600002-1\x09,1301367397529.d13ef276819124e550bb6e0be9c5cdc8.;
>>> Took too long to split the files and create the references, aborting split
>>> 2011-03-29 22:55:23,661 WARN org.apache.hadoop.ipc.Client: interrupted
>>> waiting to send params to server
>>> java.lang.InterruptedException
>>>        at
>>> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1279)
>>>
>>> More of the regionserver log here: http://pastebin.com/vnzHWXKT
>>>
>>> I saw those messages at different but close times in each of the nodes I
>>> checked. I then looked at the datanode log, and it had messages like
>>> this suspiciousness at the same time in it:
>>>
>>> 2011-03-29 22:55:12,726 ERROR
>>> org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
>>> 10.56.24.12:50010, storageID=DS-1245200625-10.56.24.12-50010-1297226452434,
>>> infoPort=50075,
>>> ipcPort=50020):DataXceiverorg.apache.hadoop.hdfs.server.datanode.BlockAlreadyExistsException:
>>> Block blk_5380327420628627923_1036814 is valid, and cannot be written to.
>>>        at
>>> org.apache.hadoop.hdfs.server.datanode.FSDataset.writeToBlock(FSDataset.java:1314)
>>>        at
>>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(BlockReceiver.java:99)
>>>       at
>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:318)
>>>        at
>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:122)
>>>
>>>
>>> Here is more of the datanode log: http://pastebin.com/yPP6EG8k
>>>
>>> Greping for that block in the datanode log yields this:
>>>
>>> 2011-03-29 22:54:54,742 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
>>> blk_5380327420628627923_1036814 src: /10.56.24.12:45205 dest: /
>>> 10.56.24.12:50010
>>> 2011-03-29 22:54:54,748 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
>>> 10.56.24.12:45205, dest: /10.56.24.12:50010, bytes: 81, op: HDFS_WRITE,
>>> cliID: DFSClient_-1734106091, offset: 0, srvID:
>>> DS-1245200625-10.56.24.12-50010-1297226452434, blockid:
>>> blk_5380327420628627923_1036814, duration: 1102000
>>> 2011-03-29 22:54:54,749 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for block
>>> blk_5380327420628627923_1036814 terminating
>>> 2011-03-29 22:55:12,582 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
>>> blk_5380327420628627923_1036814 src: /10.56.24.19:41159 dest: /
>>> 10.56.24.12:50010
>>> 2011-03-29 22:55:12,726 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
>>> blk_5380327420628627923_1036814 received exception
>>> org.apache.hadoop.hdfs.server.datanode.BlockAlreadyExistsException: Block
>>> blk_5380327420628627923_1036814 is valid, and cannot be written to.
>>> org.apache.hadoop.hdfs.server.datanode.BlockAlreadyExistsException: Block
>>> blk_5380327420628627923_1036814 is valid, and cannot be written to.
>>> 2011-03-29 22:55:17,380 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
>>> blk_5380327420628627923_1036814 file
>>> /biotique11/titan_data/hdfs/data/current/subdir52/blk_5380327420628627923
>>> for deletion
>>> 2011-03-29 22:55:17,381 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
>>> blk_5380327420628627923_1036814 at file
>>> /biotique11/titan_data/hdfs/data/current/subdir52/blk_5380327420628627923
>>> 2011-03-29 22:56:12,697 WARN
>>> org.apache.hadoop.hdfs.server.datanode.DataNode: Unexpected error trying to
>>> delete block blk_5380327420628627923_1036814. BlockInfo not found in
>>> volumeMap.
>>>
>>> It also looks like it is being deleted twice. So... I grepped the namenode
>>> log for the block and saw this:
>>>
>>> http://pastebin.com/FDv0vMSM
>>>
>>> From what I can tell it looks like the block is over-replicated, added to
>>> InvalidStates - deleted on the node via that - then deleted again when the
>>> file is deleted. I then grepped for that file in the namenode log to see
>>> what that reveals:
>>>
>>> http://pastebin.com/C6mFpvWc
>>>
>>> Not sure what to make of that - well out of my depth. It does look like the
>>> file is created, deleted, then created again.
>>>
>>> Any help figuring out what happened would be great. I'm probably going to
>>> have to redo the load but I would like to have an idea on what went wrong.
>>>
>>> thanks,
>>> -chris
>>>
>
>

Mime
View raw message