nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Angel Faus <angel.f...@gmail.com>
Subject Re: [Nutch-dev] Re: NameNode scalibility
Date Tue, 08 Mar 2005 12:34:52 GMT
Hi,

Great. Thanks for the tips.

I've tried the following startup sequences:

 * Start NameNode. Wait until CPU goes to 0. Wait 2 extra minutes.
Start all DataNodes.
 * Start NameNode. Wait until CPU goes to 0. Wait 2 extra minutes.
Start each DataNode with a 10 minutes pause between them.
 * Start all DataNodes. Wait 10 min. Start NameNode.
 
In every case, I ran into the same "Problem making IPC call".

I changed the number of threads to 100 in NameNode, without any effect. 

I would say that the biggest issue is the replication of blocks. We
are seeing tons of lines like this in the DataNode logs:

 050308 121928 Replicated block blk_-9167778052227947819 to
vlex-cluster-6/192.168.166.121:7000

Other odd things in the DataNode logs are:

 java.io.IOException: Block blk_-9157092366090071006 is valid, and
cannot be written to.
(thousands of them)

And in the NameNode, we see periodic bursts of:

 050308 131028 Lost heartbeat for vlex-cluster-3:7000
 050308 131028 Lost heartbeat for vlex-cluster-4:7000
 050308 131029 Lost heartbeat for vlex-cluster-7:7000
 050308 131029 Lost heartbeat for vlex-cluster-8:7000
 050308 131030 Lost heartbeat for vlex-cluster-9:7000
 050308 131030 Lost heartbeat for vlex-cluster-2:7000

Afterwards, of course, the remaining DataNodes try desperately to
replicate their data:

 050308 131041 Pending transfer from vlex-cluster-5:7000 to 3 destinations
 ...
 

The "Lost heartbeat" error would indicate connectivity problems, but
both "ping vlex-cluster-4" and "telnet vlex-cluster-4 7000" from the
NameNode work consistently well.

By the way, the NameNode startup time doesn't seem related to
replaying the log, since right
now "edits" is an empty file.

Summing up: I think that the best thing I can do is wait for the patch
that enables throttling of block replication, and replay the tests.

Thanks again for you responsiveness,


angel

Mime
View raw message