nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Cafarella <michael_cafare...@comcast.net>
Subject Re: [Nutch-dev] Re: NameNode scalibility
Date Tue, 08 Mar 2005 15:39:02 GMT

  Angel,

  Much of what you're seeing is part of the replication problem.

  1) The "Replicated ...." message is when a successful replication
happens.  It's not surprising that you see a lot of them.

  2) The "Block XX is valid, and cannot be written to" happens when one
node tries to replicate a block that has already been replicated.  The
bug is that we're far too aggressive with replication, so this situation
often appears.  (Precisely, the problem is that the NameNode tells far
too many datanodes to attempt replication.)

  3) The "Lost heartbeat" message can appear if a datanode is spending
all its time writing out replicate blocks instead of sending heartbeats
back to the namenode.  So it's a side effect of too much replication.

  The NameNode startup time is still something of a mystery, but we'll
look into it.

  Your info has been a big help.  Thanks.
  --Mike C.

On Tue, 2005-03-08 at 04:34, Angel Faus wrote:
> Hi,
> 
> Great. Thanks for the tips.
> 
> I've tried the following startup sequences:
> 
>  * Start NameNode. Wait until CPU goes to 0. Wait 2 extra minutes.
> Start all DataNodes.
>  * Start NameNode. Wait until CPU goes to 0. Wait 2 extra minutes.
> Start each DataNode with a 10 minutes pause between them.
>  * Start all DataNodes. Wait 10 min. Start NameNode.
>  
> In every case, I ran into the same "Problem making IPC call".
> 
> I changed the number of threads to 100 in NameNode, without any effect. 
> 
> I would say that the biggest issue is the replication of blocks. We
> are seeing tons of lines like this in the DataNode logs:
> 
>  050308 121928 Replicated block blk_-9167778052227947819 to
> vlex-cluster-6/192.168.166.121:7000
> 
> Other odd things in the DataNode logs are:
> 
>  java.io.IOException: Block blk_-9157092366090071006 is valid, and
> cannot be written to.
> (thousands of them)
> 
> And in the NameNode, we see periodic bursts of:
> 
>  050308 131028 Lost heartbeat for vlex-cluster-3:7000
>  050308 131028 Lost heartbeat for vlex-cluster-4:7000
>  050308 131029 Lost heartbeat for vlex-cluster-7:7000
>  050308 131029 Lost heartbeat for vlex-cluster-8:7000
>  050308 131030 Lost heartbeat for vlex-cluster-9:7000
>  050308 131030 Lost heartbeat for vlex-cluster-2:7000
> 
> Afterwards, of course, the remaining DataNodes try desperately to
> replicate their data:
> 
>  050308 131041 Pending transfer from vlex-cluster-5:7000 to 3 destinations
>  ...
>  
> 
> The "Lost heartbeat" error would indicate connectivity problems, but
> both "ping vlex-cluster-4" and "telnet vlex-cluster-4 7000" from the
> NameNode work consistently well.
> 
> By the way, the NameNode startup time doesn't seem related to
> replaying the log, since right
> now "edits" is an empty file.
> 
> Summing up: I think that the best thing I can do is wait for the patch
> that enables throttling of block replication, and replay the tests.
> 
> Thanks again for you responsiveness,
> 
> 
> angel



Mime
View raw message