hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jthie...@ina.fr
Subject Data lost during intensive writes
Date Wed, 04 Mar 2009 17:18:10 GMT
Hello,

I have been testing Hbase for several weeks.
My test cluster is made of 6 low cost machines (dell studio hybrid, core 2 duo 2Ghz, 4Go,
HDD 320 Go).

My configurations files :

hadoop-site.xml :

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
  <name>hadoop.tmp.dir</name>
  <value>/home/hadoop/hadoop-tmp</value>
  <description>A base for other temporary directories.</description>
</property>

<property>
  <name>dfs.data.dir</name>
  <value>/home/hadoop/hadoop-dfs/data</value>
  <description>Determines where on the local filesystem an DFS data node
  should store its blocks.  If this is a comma-delimited
  list of directories, then data will be stored in all named
  directories, typically on different devices.
  Directories that do not exist are ignored.
  </description>
</property>


<property>
  <name>fs.default.name</name>
  <value>hdfs://hephaistos:54310</value>
  <description>The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.</description>
</property>
 
<property>
  <name>mapred.job.tracker</name>
  <value>hephaistos:54311</value>
  <description>The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.
  </description>
</property>
 
<property>
  <name>dfs.replication</name>
  <value>2</value>
  <description>Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.
  </description>
</property>

<property>
  <name>dfs.block.size</name>
  <value>8388608</value>
  <description>The hbase standard size for new files.</description>
<!--<value>67108864</value>-->
<!--<description>The default block size for new files.</description>-->
</property>

<property>
   <name>dfs.datanode.max.xcievers</name>
   <value>8192</value>
   <description>Up xcievers (see HADOOP-3831)</description>
</property>
<property>
  <name>dfs.balance.bandwidthPerSec</name>
  <value>10485760</value>
  <description> Specifies the maximum bandwidth that each datanode can utilize for the
   balancing purpose in term of the number of bytes per second. Default is 1048576</description>
</property>

<property>
  <name>mapred.local.dir</name>
  <value>/home/hadoop/hadoop-mapred/local</value>
  <description>The local directory where MapReduce stores intermediate
  data files.  May be a comma-separated list of
  directories on different devices in order to spread disk i/o.
  Directories that do not exist are ignored.
  </description>
</property>

<property>
  <name>mapred.system.dir</name>
  <value>home/hadoop/hadoop-mapred/system</value>
  <description>The shared directory where MapReduce stores control files.
  </description>
</property>

<property>
  <name>mapred.temp.dir</name>
  <value>home/hadoop/hadoop-mapred/temp</value>
  <description>A shared directory for temporary files.
  </description>
</property>

<property>
  <name>mapred.map.tasks</name>
  <value>20</value>
  <description>The default number of map tasks per job.  Typically set
  to a prime several times greater than number of available hosts.
  Ignored when mapred.job.tracker is "local".  
  </description>
</property>

<property>
  <name>mapred.reduce.tasks</name>
  <value>5</value>
  <description>The default number of reduce tasks per job.  Typically set
  to a prime close to the number of available hosts.  Ignored when
  mapred.job.tracker is "local".
  </description>
</property>
</configuration>

hbase-site.xml:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
/**
 * Copyright 2007 The Apache Software Foundation
 *
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
-->
<configuration>

  <property>
    <name>hbase.rootdir</name>
    <value>hdfs://hephaistos:54310/hbase</value>
    <description>The directory shared by region servers.
    </description>
  </property>
  <property>
    <name>hbase.master</name>
    <value>hephaistos:60000</value>
    <description>The host and port that the HBase master runs at.
    </description>
  </property>
  <property>
    <name>hbase.hregion.memcache.flush.size</name>
    <value>67108864</value>
    <description>
    A HRegion memcache will be flushed to disk if size of the memcache
    exceeds this number of bytes.  Value is checked by a thread that runs
    every hbase.server.thread.wakefrequency.  
    </description>
  </property>  
  <property>
    <name>hbase.hregion.max.filesize</name>
    <value>268435456</value>
    <description>
    Maximum HStoreFile size. If any one of a column families' HStoreFiles has
    grown to exceed this value, the hosting HRegion is split in two.
    Default: 256M.
    </description>
  </property>
  <property>
    <name>hbase.io.index.interval</name>
    <value>128</value>
    <description>The interval at which we record offsets in hbase
    store files/mapfiles.  Default for stock mapfiles is 128.  Index
    files are read into memory.  If there are many of them, could prove
    a burden.  If so play with the hadoop io.map.index.skip property and
    skip every nth index member when reading back the index into memory.
    Downside to high index interval is lowered access times.
    </description>
  </property>  
  <property>
    <name>hbase.hstore.blockCache.blockSize</name>
    <value>65536</value>
    <description>The size of each block in the block cache.
    Enable blockcaching on a per column family basis; see the BLOCKCACHE setting
    in HColumnDescriptor.  Blocks are kept in a java Soft Reference cache so are
    let go when high pressure on memory.  Block caching is not enabled by default.
    Default is 16384.
    </description>
  </property>
  <property>
    <name>hbase.regionserver.lease.period</name>
    <value>240000</value>
    <description>HRegion server lease period in milliseconds. Default is
    60 seconds. Clients must report in within this period else they are
    considered dead.</description>
  </property>  
</configuration>

My main application of hbase is to build access indexes to a web archive.
My test archive contains 160.10e6 objects that I insert in an hbase instance.
Each rows contains about a thousand of bytes.

During these bacth insertions I can see some exceptions related to DataXceiver :

Case 1:

On HBase Regionserver:

2009-02-27 04:23:52,185 INFO org.apache.hadoop.hdfs.DFSClient: org.apache.hadoop.ipc.RemoteException:
org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not replicated yet:/hbase/metadata_table/compaction.dir/1476318467/content/mapfiles/260278331337921598/data
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1256)
	at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:351)
	at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:892)

	at org.apache.hadoop.ipc.Client.call(Client.java:696)
	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
	at $Proxy1.addBlock(Unknown Source)
	at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
	at $Proxy1.addBlock(Unknown Source)
	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2815)
	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2697)
	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1997)
	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)


On Hadoop Datanode:

2009-02-27 04:22:58,110 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.1.188.249:50010,
storageID=DS-1180278657-127.0.0.1-50010-1235652659245, infoPort=50075, ipcPort=50020):Got
exception while serving blk_5465578316105624003_26301 to /10.1.188.249:
java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready
for write. ch : java.nio.channels.SocketChannel[connected local=/10.1.188.249:50010 remote=/10.1.188.249:48326]
	at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:185)
	at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
	at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
	at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:293)
	at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:387)
	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:179)
	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:94)
	at java.lang.Thread.run(Thread.java:619)

2009-02-27 04:22:58,110 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.1.188.249:50010,
storageID=DS-1180278657-127.0.0.1-50010-1235652659245, infoPort=50075, ipcPort=50020):DataXceiver
java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready
for write. ch : java.nio.channels.SocketChannel[connected local=/10.1.188.249:50010 remote=/10.1.188.249:48326]
	at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:185)
	at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
	at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
	at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:293)
	at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:387)
	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:179)
	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:94)
	at java.lang.Thread.run(Thread.java:619)

Case 2:

HBase Regionserver:

2009-03-02 09:55:11,929 WARN org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor
exception  for block blk_-6496095407839777264_96895java.io.IOException: Bad response 1 for
block blk_-6496095407839777264_96895 from datanode 10.1.188.182:50010
	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)

2009-03-02 09:55:11,930 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_-6496095407839777264_96895
bad datanode[1] 10.1.188.182:50010
2009-03-02 09:55:11,930 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_-6496095407839777264_96895
in pipeline 10.1.188.249:50010, 10.1.188.182:50010, 10.1.188.203:50010: bad datanode 10.1.188.182:50010
2009-03-02 09:55:14,362 WARN org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor
exception  for block blk_-7585241287138805906_96914java.io.IOException: Bad response 1 for
block blk_-7585241287138805906_96914 from datanode 10.1.188.182:50010
	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)

2009-03-02 09:55:14,362 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_-7585241287138805906_96914
bad datanode[1] 10.1.188.182:50010
2009-03-02 09:55:14,363 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_-7585241287138805906_96914
in pipeline 10.1.188.249:50010, 10.1.188.182:50010, 10.1.188.141:50010: bad datanode 10.1.188.182:50010
2009-03-02 09:55:14,445 WARN org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor
exception  for block blk_8693483996243654850_96912java.io.IOException: Bad response 1 for
block blk_8693483996243654850_96912 from datanode 10.1.188.182:50010
	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)

2009-03-02 09:55:14,446 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_8693483996243654850_96912
bad datanode[1] 10.1.188.182:50010
2009-03-02 09:55:14,446 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_8693483996243654850_96912
in pipeline 10.1.188.249:50010, 10.1.188.182:50010, 10.1.188.203:50010: bad datanode 10.1.188.182:50010
2009-03-02 09:55:14,923 WARN org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor
exception  for block blk_-8939308025013258259_96931java.io.IOException: Bad response 1 for
block blk_-8939308025013258259_96931 from datanode 10.1.188.182:50010
	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)

2009-03-02 09:55:14,935 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_-8939308025013258259_96931
bad datanode[1] 10.1.188.182:50010
2009-03-02 09:55:14,935 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_-8939308025013258259_96931
in pipeline 10.1.188.249:50010, 10.1.188.182:50010, 10.1.188.203:50010: bad datanode 10.1.188.182:50010
2009-03-02 09:55:15,344 WARN org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor
exception  for block blk_7417692418733608681_96934java.io.IOException: Bad response 1 for
block blk_7417692418733608681_96934 from datanode 10.1.188.182:50010
	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)

2009-03-02 09:55:15,344 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_7417692418733608681_96934
bad datanode[2] 10.1.188.182:50010
2009-03-02 09:55:15,344 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_7417692418733608681_96934
in pipeline 10.1.188.249:50010, 10.1.188.203:50010, 10.1.188.182:50010: bad datanode 10.1.188.182:50010
2009-03-02 09:55:15,579 WARN org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor
exception  for block blk_6777180223564108728_96939java.io.IOException: Bad response 1 for
block blk_6777180223564108728_96939 from datanode 10.1.188.182:50010
	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)

2009-03-02 09:55:15,579 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_6777180223564108728_96939
bad datanode[1] 10.1.188.182:50010
2009-03-02 09:55:15,579 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_6777180223564108728_96939
in pipeline 10.1.188.249:50010, 10.1.188.182:50010, 10.1.188.203:50010: bad datanode 10.1.188.182:50010
2009-03-02 09:55:15,930 WARN org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor
exception  for block blk_-6352908575431276531_96948java.io.IOException: Bad response 1 for
block blk_-6352908575431276531_96948 from datanode 10.1.188.182:50010
	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)

2009-03-02 09:55:15,930 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_-6352908575431276531_96948
bad datanode[2] 10.1.188.182:50010
2009-03-02 09:55:15,930 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_-6352908575431276531_96948
in pipeline 10.1.188.249:50010, 10.1.188.30:50010, 10.1.188.182:50010: bad datanode 10.1.188.182:50010
2009-03-02 09:55:15,988 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_SPLIT:
metadata_table,r:http://com.over-blog.www/_cdata/img/footer_mid.gif@20070505132942-20070505132942,1235761772185
2009-03-02 09:55:16,008 WARN org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor
exception  for block blk_-1071965721931053111_96956java.io.IOException: Bad response 1 for
block blk_-1071965721931053111_96956 from datanode 10.1.188.182:50010
	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)

2009-03-02 09:55:16,008 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_-1071965721931053111_96956
bad datanode[2] 10.1.188.182:50010
2009-03-02 09:55:16,009 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_-1071965721931053111_96956
in pipeline 10.1.188.249:50010, 10.1.188.203:50010, 10.1.188.182:50010: bad datanode 10.1.188.182:50010
2009-03-02 09:55:16,073 WARN org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor
exception  for block blk_1004039574836775403_96959java.io.IOException: Bad response 1 for
block blk_1004039574836775403_96959 from datanode 10.1.188.182:50010
	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)

2009-03-02 09:55:16,073 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_1004039574836775403_96959
bad datanode[1] 10.1.188.182:50010


Hadoop datanode:

2009-03-02 09:55:10,201 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
blk_-5472632607337755080_96875 1 Exception java.io.EOFException
	at java.io.DataInputStream.readFully(DataInputStream.java:180)
	at java.io.DataInputStream.readLong(DataInputStream.java:399)
	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:833)
	at java.lang.Thread.run(Thread.java:619)

2009-03-02 09:55:10,407 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
1 for block blk_-5472632607337755080_96875 terminating
2009-03-02 09:55:10,516 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.1.188.249:50010,
storageID=DS-1180278657-127.0.0.1-50010-1235652659245, infoPort=50075, ipcPort=50020):Exception
writing block blk_-5472632607337755080_96875 to mirror 10.1.188.182:50010
java.io.IOException: Broken pipe
	at sun.nio.ch.FileDispatcher.write0(Native Method)
	at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29)
	at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:104)
	at sun.nio.ch.IOUtil.write(IOUtil.java:75)
	at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:334)
	at org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55)
	at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140)
	at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
	at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
	at java.io.DataOutputStream.write(DataOutputStream.java:90)
	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:391)
	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:514)
	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:356)
	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:102)
	at java.lang.Thread.run(Thread.java:619)

2009-03-02 09:55:10,517 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in
receiveBlock for block blk_-5472632607337755080_96875 java.io.IOException: Broken pipe
2009-03-02 09:55:10,517 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock blk_-5472632607337755080_96875
received exception java.io.IOException: Broken pipe
2009-03-02 09:55:10,517 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.1.188.249:50010,
storageID=DS-1180278657-127.0.0.1-50010-1235652659245, infoPort=50075, ipcPort=50020):DataXceiver
java.io.IOException: Broken pipe
	at sun.nio.ch.FileDispatcher.write0(Native Method)
	at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29)
	at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:104)
	at sun.nio.ch.IOUtil.write(IOUtil.java:75)
	at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:334)
	at org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55)
	at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140)
	at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
	at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
	at java.io.DataOutputStream.write(DataOutputStream.java:90)
	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:391)
	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:514)
	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:356)
	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:102)
	at java.lang.Thread.run(Thread.java:619)
2009-03-02 09:55:11,174 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace:
src: /10.1.188.249:49063, dest: /10.1.188.249:50010, bytes: 312, op: HDFS_WRITE, cliID: DFSClient_1091437257,
srvID: DS-1180278657-127.0.0.1-50010-1235652659245, blockid: blk_5027345212081735473_96878
2009-03-02 09:55:11,177 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
2 for block blk_5027345212081735473_96878 terminating
2009-03-02 09:55:11,185 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
blk_-3992843464553216223_96885 src: /10.1.188.249:49069 dest: /10.1.188.249:50010
2009-03-02 09:55:11,186 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
blk_-3132070329589136987_96885 src: /10.1.188.30:33316 dest: /10.1.188.249:50010
2009-03-02 09:55:11,187 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in
receiveBlock for block blk_8782629414415941143_96845 java.io.IOException: Connection reset
by peer
2009-03-02 09:55:11,187 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
0 for block blk_8782629414415941143_96845 Interrupted.
2009-03-02 09:55:11,187 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
0 for block blk_8782629414415941143_96845 terminating
2009-03-02 09:55:11,187 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock blk_8782629414415941143_96845
received exception java.io.IOException: Connection reset by peer
2009-03-02 09:55:11,187 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.1.188.249:50010,
storageID=DS-1180278657-127.0.0.1-50010-1235652659245, infoPort=50075, ipcPort=50020):DataXceiver
java.io.IOException: Connection reset by peer
	at sun.nio.ch.FileDispatcher.read0(Native Method)
	at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
	at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233)
	at sun.nio.ch.IOUtil.read(IOUtil.java:206)
	at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
	at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
	at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140)
	at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:150)
	at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:123)
	at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
	at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
	at java.io.DataInputStream.read(DataInputStream.java:132)
	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:251)
	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:298)
	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:362)
	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:514)
	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:356)
	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:102)
	at java.lang.Thread.run(Thread.java:619)
        etc.............................

I have others exceptions related to DataXceivers problems. These errors doesn't make the region
server go down, but I can see that I lost some records (about 3.10e6 out of 160.10e6).

As you can see in my conf files, I up the dfs.datanode.max.xcievers to 8192 as suggested from
several mails.
And my ulimit -n is at 32768.

Do these problems come from my configuration, or my hardware ?

Jérôme Thièvre





Mime
View raw message