hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stanley Xu <wenhao...@gmail.com>
Subject Re: Error of "Got error in response to OP_READ_BLOCK for file"
Date Wed, 11 May 2011 16:06:51 GMT
Dear all,

I just checked our log today. And found the following logs


2011-05-11 16:46:06,258 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
blk_7212216405058183301_3974453 src: /10.0.2.39:60393 dest: /10.0.2.39:50010
2011-05-11 16:46:14,716 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
10.0.2.39:60393, dest: /10.0.2.39:50010, bytes: 83774037, op: HDFS_WRITE,
cliID: DFSClient_41752680, srvID:
DS-1901535396-192.168.11.112-50010-1285486752139, blockid:
blk_7212216405058183301_3974453
2011-05-11 16:46:14,716 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for block
blk_7212216405058183301_3974453 terminating
2011-05-11 16:46:14,764 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
10.0.2.39:50010, dest: /10.0.2.39:60395, bytes: 89, op: HDFS_READ, cliID:
DFSClient_41752680, srvID: DS-1901535396-192.168.11.112-50010-1285486752139,
blockid: blk_7212216405058183301_3974453
2011-05-11 16:46:14,764 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
10.0.2.39:50010, dest: /10.0.2.39:60396, bytes: 84197, op: HDFS_READ, cliID:
DFSClient_41752680, srvID: DS-1901535396-192.168.11.112-50010-1285486752139,
blockid: blk_7212216405058183301_3974453
2011-05-11 18:33:50,189 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
10.0.2.39:50010, dest: /10.0.2.26:52069, bytes: 89, op: HDFS_READ, cliID:
DFSClient_1460045357, srvID:
DS-1901535396-192.168.11.112-50010-1285486752139, blockid:
blk_7212216405058183301_3974453
2011-05-11 18:33:50,193 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
10.0.2.39:50010, dest: /10.0.2.26:52070, bytes: 84197, op: HDFS_READ, cliID:
DFSClient_1460045357, srvID:
DS-1901535396-192.168.11.112-50010-1285486752139, blockid:
blk_7212216405058183301_3974453
2011-05-11 18:56:48,922 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
10.0.2.39:50010, dest: /10.0.2.39:48272, bytes: 84428525, op: HDFS_READ,
cliID: DFSClient_41752680, srvID:
DS-1901535396-192.168.11.112-50010-1285486752139, blockid:
blk_7212216405058183301_3974453
2011-05-11 18:57:04,532 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Deleting block
blk_7212216405058183301_3974453 file
/hadoop/dfs/data/current/subdir3/subdir10/blk_7212216405058183301
2011-05-11 19:04:54,971 WARN
org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
10.0.2.39:50010, storageID=DS-1901535396-192.168.11.112-50010-1285486752139,
infoPort=50075, ipcPort=50020):Got exception while serving
blk_7212216405058183301_3974453 to /10.0.2.26:
java.io.IOException: Block blk_7212216405058183301_3974453 is not valid.
java.io.IOException: Block blk_7212216405058183301_3974453 is not valid.
2011-05-11 20:25:14,600 WARN
org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
10.0.2.39:50010, storageID=DS-1901535396-192.168.11.112-50010-1285486752139,
infoPort=50075, ipcPort=50020):Got exception while serving
blk_7212216405058183301_3974453 to /10.0.2.26:
java.io.IOException: Block blk_7212216405058183301_3974453 is not valid.
java.io.IOException: Block blk_7212216405058183301_3974453 is not valid.


It looks that the DataNode first delete the block, and then try to serve the
block from a RegionServer request. Should I assume that's what you desribe
for the corruption in the .META. level?

And if we wanted to upgrade to the 0.20-append branch, is there any changes
in the infrastructure level, like the File System format changed we should
notice? Could I just create a build from the 0.20-append branch and replace
the jars in the cluster and restart the server?

Thanks in advance.

Best wishes,
Stanley Xu



On Wed, May 11, 2011 at 12:50 AM, Jean-Daniel Cryans <jdcryans@apache.org>wrote:

> Data cannot be corrupted at all, since the files in HDFS are immutable
> and CRC'ed (unless you are able to lose all 3 copies of every block).
>
> Corruption would happen at the metadata level, whereas the .META.
> table which contains the regions for the tables would lose rows. This
> is a likely scenario if the region server holding that region dies of
> GC since the hadoop version you are using along hbase 0.20.6 doesn't
> support appends, meaning that the write-ahead log would be missing
> data that, obviously, cannot be replayed.
>
> The best advice I can give you is to upgrade.
>
> J-D
>
> On Tue, May 10, 2011 at 5:44 AM, Stanley Xu <wenhao.xu@gmail.com> wrote:
> > Thanks J-D. A little more confused that is it looks when we have a
> corrupt
> > hbase table or some inconsistency data, we will got lots of message like
> > that. But if the hbase table is proper, we will also get some lines of
> > messages like that.
> >
> > How could I identify if it comes from a corruption in data or just some
> > mis-hit in the scenario you mentioned?
> >
> >
> >
> > On Tue, May 10, 2011 at 6:23 AM, Jean-Daniel Cryans <jdcryans@apache.org
> >wrote:
> >
> >> Very often the "cannot open filename" happens when the region in
> >> question was reopened somewhere else and that region was compacted. As
> >> to why it was reassigned, most of the time it's because of garbage
> >> collections taking too long. The master log should have all the
> >> required evidence, and the region server should print some "slept for
> >> Xms" (where X is some number of ms) messages before everything goes
> >> bad.
> >>
> >> Here are some general tips on debugging problems in HBase
> >> http://hbase.apache.org/book/trouble.html
> >>
> >> J-D
> >>
> >> On Sat, May 7, 2011 at 2:10 AM, Stanley Xu <wenhao.xu@gmail.com> wrote:
> >> > Dear all,
> >> >
> >> > We were using HBase 0.20.6 in our environment, and it is pretty stable
> in
> >> > the last couple of month, but we met some reliability issue from last
> >> week.
> >> > Our situation is very like the following link.
> >> >
> >>
> http://search-hadoop.com/m/UJW6Efw4UW/Got+error+in+response+to+OP_READ_BLOCK+for+file&subj=HBase+fail+over+reliability+issues
> >> >
> >> > When we use a hbase client to connect to the hbase table, it looks
> stuck
> >> > there. And we can find the logs like
> >> >
> >> > WARN org.apache.hadoop.hdfs.DFSClient: Failed to connect to /
> >> > 10.24.166.74:50010 for *file*
> >> /hbase/users/73382377/data/312780071564432169
> >> > for block -4841840178880951849:java.io.IOException: *Got* *error* in *
> >> > response* to
> >> > OP_READ_BLOCK for *file* /hbase/users/73382377/data/312780071564432169
> >> for
> >> > block -4841840178880951849
> >> >
> >> > INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 40 on
> 60020,
> >> call
> >> > get([B@25f907b4, row=963aba6c5f351f5655abdc9db82a4cbd, maxVersions=1,
> >> > timeRange=[0,9223372036854775807), families={(family=data,
> columns=ALL})
> >> > from 10.24.117.100:2365: *error*: java.io.IOException: Cannot open
> >> filename
> >> > /hbase/users/73382377/data/312780071564432169
> >> > java.io.IOException: Cannot open filename
> >> > /hbase/users/73382377/data/312780071564432169
> >> >
> >> >
> >> > WARN org.apache.hadoop.hdfs.server.datanode.DataNode:
> >> DatanodeRegistration(
> >> > 10.24.166.74:50010,
> >> storageID=DS-14401423-10.24.166.74-50010-1270741415211,
> >> > infoPort=50075, ipcPort=50020):
> >> > *Got* exception while serving blk_-4841840178880951849_50277 to /
> >> > 10.25.119.113
> >> > :
> >> > java.io.IOException: Block blk_-4841840178880951849_50277 is not
> valid.
> >> >
> >> > in the server side.
> >> >
> >> > And if we do a flush and then a major compaction on the ".META.", the
> >> > problem just went away, but will appear again some time later.
> >> >
> >> > At first we guess it might be the problem of xceiver. So we set the
> >> xceiver
> >> > to 4096 as the link here.
> >> >
> http://ccgtech.blogspot.com/2010/02/hadoop-hdfs-deceived-by-xciever.html
> >> >
> >> > But we still get the same problem. It looks that a restart of the
> whole
> >> > HBase cluster will fix the problem for a while, but actually we could
> not
> >> > say always trying to restart the server.
> >> >
> >> > I am waiting online, will really appreciate any message.
> >> >
> >> >
> >> > Best wishes,
> >> > Stanley Xu
> >> >
> >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message