hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From stack <st...@duboce.net>
Subject Re: Data lost during intensive writes
Date Fri, 06 Mar 2009 19:15:34 GMT
On Wed, Mar 4, 2009 at 9:18 AM, <jthievre@ina.fr> wrote:

> <property>
>  <name>dfs.replication</name>
>  <value>2</value>
>  <description>Default block replication.
>  The actual number of replications can be specified when the file is
> created.
>  The default is used if replication is not specified in create time.
>  </description>
> </property>
>
> <property>
>  <name>dfs.block.size</name>
>  <value>8388608</value>
>  <description>The hbase standard size for new files.</description>
> <!--<value>67108864</value>-->
> <!--<description>The default block size for new files.</description>-->
> </property>
>


The above are non-standard.  A replication of 3 might lessen the incidence
of HDFS errors seen since there will be another replica to go to.   Why
non-standard block size?

I did not see *dfs.datanode.socket.write.timeout* set to 0.  Is that because
you are running w/ 0.19.0?  You might try with it especially because in the
below I see complaint about the timeout (but more below on this).



>  <property>
>    <name>hbase.hstore.blockCache.blockSize</name>
>    <value>65536</value>
>    <description>The size of each block in the block cache.
>    Enable blockcaching on a per column family basis; see the BLOCKCACHE
> setting
>    in HColumnDescriptor.  Blocks are kept in a java Soft Reference cache so
> are
>    let go when high pressure on memory.  Block caching is not enabled by
> default.
>    Default is 16384.
>    </description>
>  </property>
>


Are you using blockcaching?  If so, 64k was problematic in my testing
(OOMEing).




> Case 1:
>
> On HBase Regionserver:
>
> 2009-02-27 04:23:52,185 INFO org.apache.hadoop.hdfs.DFSClient:
> org.apache.hadoop.ipc.RemoteException:
> org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not
> replicated
> yet:/hbase/metadata_table/compaction.dir/1476318467/content/mapfiles/260278331337921598/data
>        at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1256)
>        at
> org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:351)
>        at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source)
>        at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
>        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:892)
>
>        at org.apache.hadoop.ipc.Client.call(Client.java:696)
>        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
>        at $Proxy1.addBlock(Unknown Source)
>        at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
>        at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
>        at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
>        at $Proxy1.addBlock(Unknown Source)
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2815)
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2697)
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1997)
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)
>
>
> On Hadoop Datanode:
>
> 2009-02-27 04:22:58,110 WARN
> org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
> 10.1.188.249:50010, storageID=DS-1180278657-127.0.0.1-50010-1235652659245,
> infoPort=50075, ipcPort=50020):Got exception while serving
> blk_5465578316105624003_26301 to /10.1.188.249:
> java.net.SocketTimeoutException: 480000 millis timeout while waiting for
> channel to be ready for write. ch :
> java.nio.channels.SocketChannel[connected local=/10.1.188.249:50010remote=/
> 10.1.188.249:48326]
>        at
> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:185)
>        at
> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
>        at
> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
>        at
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:293)
>        at
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:387)
>        at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:179)
>        at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:94)
>        at java.lang.Thread.run(Thread.java:619)
>
> 2009-02-27 04:22:58,110 ERROR
> org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
> 10.1.188.249:50010, storageID=DS-1180278657-127.0.0.1-50010-1235652659245,
> infoPort=50075, ipcPort=50020):DataXceiver
> java.net.SocketTimeoutException: 480000 millis timeout while waiting for
> channel to be ready for write. ch :
> java.nio.channels.SocketChannel[connected local=/10.1.188.249:50010remote=/
> 10.1.188.249:48326]
>        at
> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:185)
>        at
> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
>        at
> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
>        at
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:293)
>        at
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:387)
>        at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:179)
>        at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:94)
>        at java.lang.Thread.run(Thread.java:619)


Are you sure the regionserver error matches the datanode error?

My understanding is that in 0.19.0, DFSClient in regionserver is supposed to
reestablish timed-out connections.  If that is not happening in your case --
and we've speculated some that there might holes in this mechanism -- try
with timeout set to zero (see citation above; be sure the configuration can
be seen by the DFSClient running in hbase by either adding to hbase-site.xml
or somehow get the hadoop-site.xml into hbase CLASSPATH
(hbase-env.sh#HBASE_CLASSPATH or with a symlink into the HBASE_HOME/conf
dir).



> Case 2:
>
> HBase Regionserver:
>
> 2009-03-02 09:55:11,929 WARN org.apache.hadoop.hdfs.DFSClient:
> DFSOutputStream ResponseProcessor exception  for block
> blk_-6496095407839777264_96895java.io.IOException: Bad response 1 for block
> blk_-6496095407839777264_96895 from datanode 10.1.188.182:50010
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)
>
> 2009-03-02 09:55:11,930 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_-6496095407839777264_96895 bad datanode[1]
> 10.1.188.182:50010
> 2009-03-02 09:55:11,930 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_-6496095407839777264_96895 in pipeline
> 10.1.188.249:50010, 10.1.188.182:50010, 10.1.188.203:50010: bad datanode
> 10.1.188.182:50010
> 2009-03-02 09:55:14,362 WARN org.apache.hadoop.hdfs.DFSClient:
> DFSOutputStream ResponseProcessor exception  for block
> blk_-7585241287138805906_96914java.io.IOException: Bad response 1 for block
> blk_-7585241287138805906_96914 from datanode 10.1.188.182:50010
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)
>
> 2009-03-02 09:55:14,362 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_-7585241287138805906_96914 bad datanode[1]
> 10.1.188.182:50010
> 2009-03-02 09:55:14,363 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_-7585241287138805906_96914 in pipeline
> 10.1.188.249:50010, 10.1.188.182:50010, 10.1.188.141:50010: bad datanode
> 10.1.188.182:50010
> 2009-03-02 09:55:14,445 WARN org.apache.hadoop.hdfs.DFSClient:
> DFSOutputStream ResponseProcessor exception  for block
> blk_8693483996243654850_96912java.io.IOException: Bad response 1 for block
> blk_8693483996243654850_96912 from datanode 10.1.188.182:50010
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)
>
> 2009-03-02 09:55:14,446 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_8693483996243654850_96912 bad datanode[1]
> 10.1.188.182:50010
> 2009-03-02 09:55:14,446 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_8693483996243654850_96912 in pipeline
> 10.1.188.249:50010, 10.1.188.182:50010, 10.1.188.203:50010: bad datanode
> 10.1.188.182:50010
> 2009-03-02 09:55:14,923 WARN org.apache.hadoop.hdfs.DFSClient:
> DFSOutputStream ResponseProcessor exception  for block
> blk_-8939308025013258259_96931java.io.IOException: Bad response 1 for block
> blk_-8939308025013258259_96931 from datanode 10.1.188.182:50010
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)
>
> 2009-03-02 09:55:14,935 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_-8939308025013258259_96931 bad datanode[1]
> 10.1.188.182:50010
> 2009-03-02 09:55:14,935 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_-8939308025013258259_96931 in pipeline
> 10.1.188.249:50010, 10.1.188.182:50010, 10.1.188.203:50010: bad datanode
> 10.1.188.182:50010
> 2009-03-02 09:55:15,344 WARN org.apache.hadoop.hdfs.DFSClient:
> DFSOutputStream ResponseProcessor exception  for block
> blk_7417692418733608681_96934java.io.IOException: Bad response 1 for block
> blk_7417692418733608681_96934 from datanode 10.1.188.182:50010
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)
>
> 2009-03-02 09:55:15,344 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_7417692418733608681_96934 bad datanode[2]
> 10.1.188.182:50010
> 2009-03-02 09:55:15,344 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_7417692418733608681_96934 in pipeline
> 10.1.188.249:50010, 10.1.188.203:50010, 10.1.188.182:50010: bad datanode
> 10.1.188.182:50010
> 2009-03-02 09:55:15,579 WARN org.apache.hadoop.hdfs.DFSClient:
> DFSOutputStream ResponseProcessor exception  for block
> blk_6777180223564108728_96939java.io.IOException: Bad response 1 for block
> blk_6777180223564108728_96939 from datanode 10.1.188.182:50010
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)
>
> 2009-03-02 09:55:15,579 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_6777180223564108728_96939 bad datanode[1]
> 10.1.188.182:50010
> 2009-03-02 09:55:15,579 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_6777180223564108728_96939 in pipeline
> 10.1.188.249:50010, 10.1.188.182:50010, 10.1.188.203:50010: bad datanode
> 10.1.188.182:50010
> 2009-03-02 09:55:15,930 WARN org.apache.hadoop.hdfs.DFSClient:
> DFSOutputStream ResponseProcessor exception  for block
> blk_-6352908575431276531_96948java.io.IOException: Bad response 1 for block
> blk_-6352908575431276531_96948 from datanode 10.1.188.182:50010
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)
>
> 2009-03-02 09:55:15,930 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_-6352908575431276531_96948 bad datanode[2]
> 10.1.188.182:50010
> 2009-03-02 09:55:15,930 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_-6352908575431276531_96948 in pipeline
> 10.1.188.249:50010, 10.1.188.30:50010, 10.1.188.182:50010: bad datanode
> 10.1.188.182:50010
> 2009-03-02 09:55:15,988 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Worker:
> MSG_REGION_SPLIT: metadata_table,r:
> http://com.over-blog.www/_cdata/img/footer_mid.gif@20070505132942-20070505132942,1235761772185
> 2009-03-02<http://com.over-blog.www/_cdata/img/footer_mid.gif@20070505132942-20070505132942,1235761772185%0A2009-03-02>09:55:16,008
WARN org.apache.hadoop.hdfs.DFSClient: DFSOutputStream
> ResponseProcessor exception  for block
> blk_-1071965721931053111_96956java.io.IOException: Bad response 1 for block
> blk_-1071965721931053111_96956 from datanode 10.1.188.182:50010
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)
>
> 2009-03-02 09:55:16,008 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_-1071965721931053111_96956 bad datanode[2]
> 10.1.188.182:50010
> 2009-03-02 09:55:16,009 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_-1071965721931053111_96956 in pipeline
> 10.1.188.249:50010, 10.1.188.203:50010, 10.1.188.182:50010: bad datanode
> 10.1.188.182:50010
> 2009-03-02 09:55:16,073 WARN org.apache.hadoop.hdfs.DFSClient:
> DFSOutputStream ResponseProcessor exception  for block
> blk_1004039574836775403_96959java.io.IOException: Bad response 1 for block
> blk_1004039574836775403_96959 from datanode 10.1.188.182:50010
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)
>
> 2009-03-02 09:55:16,073 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_1004039574836775403_96959 bad datanode[1]
> 10.1.188.182:50010
>
>
> Hadoop datanode:
>
> 2009-03-02 09:55:10,201 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
> blk_-5472632607337755080_96875 1 Exception java.io.EOFException
>        at java.io.DataInputStream.readFully(DataInputStream.java:180)
>        at java.io.DataInputStream.readLong(DataInputStream.java:399)
>        at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:833)
>        at java.lang.Thread.run(Thread.java:619)
>
> 2009-03-02 09:55:10,407 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 1 for block
> blk_-5472632607337755080_96875 terminating
> 2009-03-02 09:55:10,516 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
> 10.1.188.249:50010, storageID=DS-1180278657-127.0.0.1-50010-1235652659245,
> infoPort=50075, ipcPort=50020):Exception writing block
> blk_-5472632607337755080_96875 to mirror 10.1.188.182:50010
> java.io.IOException: Broken pipe
>        at sun.nio.ch.FileDispatcher.write0(Native Method)
>        at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29)
>        at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:104)
>        at sun.nio.ch.IOUtil.write(IOUtil.java:75)
>        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:334)
>        at
> org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55)
>        at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140)
>        at
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
>        at
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
>        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
>        at java.io.DataOutputStream.write(DataOutputStream.java:90)
>        at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:391)
>        at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:514)
>        at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:356)
>        at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:102)
>        at java.lang.Thread.run(Thread.java:619)
>
> 2009-03-02 09:55:10,517 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in receiveBlock
> for block blk_-5472632607337755080_96875 java.io.IOException: Broken pipe
> 2009-03-02 09:55:10,517 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
> blk_-5472632607337755080_96875 received exception java.io.IOException:
> Broken pipe
> 2009-03-02 09:55:10,517 ERROR
> org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
> 10.1.188.249:50010, storageID=DS-1180278657-127.0.0.1-50010-1235652659245,
> infoPort=50075, ipcPort=50020):DataXceiver
> java.io.IOException: Broken pipe
>        at sun.nio.ch.FileDispatcher.write0(Native Method)
>        at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29)
>        at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:104)
>        at sun.nio.ch.IOUtil.write(IOUtil.java:75)
>        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:334)
>        at
> org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55)
>        at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140)
>        at
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
>        at
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
>        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
>        at java.io.DataOutputStream.write(DataOutputStream.java:90)
>        at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:391)
>        at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:514)
>        at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:356)
>        at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:102)
>        at java.lang.Thread.run(Thread.java:619)
> 2009-03-02 09:55:11,174 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> 10.1.188.249:49063, dest: /10.1.188.249:50010, bytes: 312, op: HDFS_WRITE,
> cliID: DFSClient_1091437257, srvID:
> DS-1180278657-127.0.0.1-50010-1235652659245, blockid:
> blk_5027345212081735473_96878
> 2009-03-02 09:55:11,177 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for block
> blk_5027345212081735473_96878 terminating
> 2009-03-02 09:55:11,185 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
> blk_-3992843464553216223_96885 src: /10.1.188.249:49069 dest: /
> 10.1.188.249:50010
> 2009-03-02 09:55:11,186 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
> blk_-3132070329589136987_96885 src: /10.1.188.30:33316 dest: /
> 10.1.188.249:50010
> 2009-03-02 09:55:11,187 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in receiveBlock
> for block blk_8782629414415941143_96845 java.io.IOException: Connection
> reset by peer
> 2009-03-02 09:55:11,187 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for block
> blk_8782629414415941143_96845 Interrupted.
> 2009-03-02 09:55:11,187 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for block
> blk_8782629414415941143_96845 terminating
> 2009-03-02 09:55:11,187 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
> blk_8782629414415941143_96845 received exception java.io.IOException:
> Connection reset by peer
> 2009-03-02 09:55:11,187 ERROR
> org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
> 10.1.188.249:50010, storageID=DS-1180278657-127.0.0.1-50010-1235652659245,
> infoPort=50075, ipcPort=50020):DataXceiver
> java.io.IOException: Connection reset by peer
>        at sun.nio.ch.FileDispatcher.read0(Native Method)
>        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
>        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233)
>        at sun.nio.ch.IOUtil.read(IOUtil.java:206)
>        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
>        at
> org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
>        at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140)
>        at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:150)
>        at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:123)
>        at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
>        at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
>        at java.io.DataInputStream.read(DataInputStream.java:132)
>        at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:251)
>        at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:298)
>        at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:362)
>        at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:514)
>        at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:356)
>        at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:102)
>        at java.lang.Thread.run(Thread.java:619)
>        etc.............................



This looks like an HDFS issue where it won't move on past the bad server
182.  On client side, they are reported as WARN in the dfsclient but don't
make it up to regionserver so not much we can do about it.


I have others exceptions related to DataXceivers problems. These errors
> doesn't make the region server go down, but I can see that I lost some
> records (about 3.10e6 out of 160.10e6).
>


Any regionserver crashes during your upload?  I'd think this more the reason
for dataloss; i.e. edits that were in memcache didn't make it out to the
filesystem because there is still no working flush in hdfs -- hopefully 0.21
hadoop... see HADOOP-4379.... (though your scenario 2 above looks like we
could have handed hdfs the data but it dropped it anyways....)



>
> As you can see in my conf files, I up the dfs.datanode.max.xcievers to 8192
> as suggested from several mails.
> And my ulimit -n is at 32768.


Make sure you can see that above is for sure in place by looking at the head
of your regionserver log on startup.



> Do these problems come from my configuration, or my hardware ?
>


Lets do some more back and forth and make sure we have done all we can
regards the software configuration.  Its probably not hardware going by the
above.

Tell us more about your uploading process and your schema.  Did all load?
If so, on your 6 servers, how many regions?  How did you verify how much was
loaded?

St.Ack

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message