hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Daniel Cryans <jdcry...@apache.org>
Subject Re: HBase strangeness and double deletes of HDFS blocks and writing to closed blocks
Date Fri, 01 Apr 2011 01:00:04 GMT
I wouldn't worry too much at the moment for what seems to be double
deletes of blocks, I'd like to concentrate on the state of your
cluster first.

So if you run hbck, do you see any inconsistencies?

In the datanode logs, do you see any exceptions regarding xcievers
(just in case).

In the region server logs, after the split is failed, do you see a
line that starts with "Successful rollback of failed split of..."? I
would hope so as the region server would have aborted in case the
split failed.

Actually would it be possible to see bigger logs? As much as I like
the deep investigation you did, I don't want to go down a single debug
path when something else might have happened and never got a chance to
see it. Very very often, what you think is HDFS issues can turn out to
be network issues (did you do those checks?).

Same for the master logs, it'd be nice to look for any weirdness.



On Thu, Mar 31, 2011 at 5:03 PM, Christopher Tarnas <cft@email.com> wrote:
> I've been trying to track down some hbase strangeness from what looks to be
> lost hbase puts: in one thrift put we insert data into two different column
> families at different rowkeys, but only one of the rows is there.  There
> were no errors to the client or the thrift log, which is a little
> disturbing. This is fairly well tested code that has worked flawlessly up to
> this point so I started to look for problems in the hbase and hadoop logs.
> We are using CDH3b4.
> This happened during a large >1 billion row load over 11 hours . In the
> regionserver logs it looks like the they were having trouble talking with
> the datanodes during splits. I'd see worrying stuff like this in the
> regionserver log:
> 2011-03-29 22:55:12,946 INFO org.apache.hadoop.hdfs.DFSClient: Could not
> complete file
> /hbase/sequence/d13ef276819124e550bb6e0be9c5cdc8/splits/c6f102f1781897a0bd2025bd8252c3cd/core/9219969397457794945.d13ef276819124e550bb6e0be9c5cdc8
> retrying...
> 2011-03-29 22:55:23,628 INFO
> org.apache.hadoop.hbase.regionserver.CompactSplitThread: Running rollback of
> failed split of
> sequence,MI6Q0pMyu3mQDR7hp71RBA\x093d6ae329777234c5ae9019cf8f5cfe80-A2600002-1\x09,1301367397529.d13ef276819124e550bb6e0be9c5cdc8.;
> Took too long to split the files and create the references, aborting split
> 2011-03-29 22:55:23,661 WARN org.apache.hadoop.ipc.Client: interrupted
> waiting to send params to server
> java.lang.InterruptedException
>        at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1279)
> More of the regionserver log here: http://pastebin.com/vnzHWXKT
> I saw those messages at different but close times in each of the nodes I
> checked. I then looked at the datanode log, and it had messages like
> this suspiciousness at the same time in it:
> 2011-03-29 22:55:12,726 ERROR
> org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
>, storageID=DS-1245200625-,
> infoPort=50075,
> ipcPort=50020):DataXceiverorg.apache.hadoop.hdfs.server.datanode.BlockAlreadyExistsException:
> Block blk_5380327420628627923_1036814 is valid, and cannot be written to.
>        at
> org.apache.hadoop.hdfs.server.datanode.FSDataset.writeToBlock(FSDataset.java:1314)
>        at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(BlockReceiver.java:99)
>       at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:318)
>        at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:122)
> Here is more of the datanode log: http://pastebin.com/yPP6EG8k
> Greping for that block in the datanode log yields this:
> 2011-03-29 22:54:54,742 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
> blk_5380327420628627923_1036814 src: / dest: /
> 2011-03-29 22:54:54,748 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
>, dest: /, bytes: 81, op: HDFS_WRITE,
> cliID: DFSClient_-1734106091, offset: 0, srvID:
> DS-1245200625-, blockid:
> blk_5380327420628627923_1036814, duration: 1102000
> 2011-03-29 22:54:54,749 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for block
> blk_5380327420628627923_1036814 terminating
> 2011-03-29 22:55:12,582 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
> blk_5380327420628627923_1036814 src: / dest: /
> 2011-03-29 22:55:12,726 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
> blk_5380327420628627923_1036814 received exception
> org.apache.hadoop.hdfs.server.datanode.BlockAlreadyExistsException: Block
> blk_5380327420628627923_1036814 is valid, and cannot be written to.
> org.apache.hadoop.hdfs.server.datanode.BlockAlreadyExistsException: Block
> blk_5380327420628627923_1036814 is valid, and cannot be written to.
> 2011-03-29 22:55:17,380 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
> blk_5380327420628627923_1036814 file
> /biotique11/titan_data/hdfs/data/current/subdir52/blk_5380327420628627923
> for deletion
> 2011-03-29 22:55:17,381 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
> blk_5380327420628627923_1036814 at file
> /biotique11/titan_data/hdfs/data/current/subdir52/blk_5380327420628627923
> 2011-03-29 22:56:12,697 WARN
> org.apache.hadoop.hdfs.server.datanode.DataNode: Unexpected error trying to
> delete block blk_5380327420628627923_1036814. BlockInfo not found in
> volumeMap.
> It also looks like it is being deleted twice. So... I grepped the namenode
> log for the block and saw this:
> http://pastebin.com/FDv0vMSM
> From what I can tell it looks like the block is over-replicated, added to
> InvalidStates - deleted on the node via that - then deleted again when the
> file is deleted. I then grepped for that file in the namenode log to see
> what that reveals:
> http://pastebin.com/C6mFpvWc
> Not sure what to make of that - well out of my depth. It does look like the
> file is created, deleted, then created again.
> Any help figuring out what happened would be great. I'm probably going to
> have to redo the load but I would like to have an idea on what went wrong.
> thanks,
> -chris

View raw message