nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "webmaster" <>
Subject Fw: Re: near-term plan
Date Fri, 05 Aug 2005 11:31:08 GMT

---------- Forwarded Message -----------
From: "webmaster" <>
Sent: Thu, 4 Aug 2005 19:42:53 -0500
Subject: Re: near-term plan

I was using a nightly build that Pitor had given me the nutch-nightly.jar 
(actually it was nutch-dev0.7.jar or something of that nature) I tested it on 
the windows platform, I had 5 machines running it, 2 at 100 mbit both quad p3 
xeon, 1 pentium 4 3ghz hyperthreading, 1 amd athlon xp 2600+ and 1 Athlon 64 
3500+. all have 1gb or more of ram. now I have my big server and if you have 
worked on ndfs since the begining of july I'll test it again, my big server's 
HD array is very fast 200+mbytes a sec, so it will be able to fully saturate 
gigabit better. anyway the p4 and the 2 amd machines are hooked into the 
switch at gigabit and the 2 xeons are hooked into my other switch at 100mbit, 
but it has a gigabit uplink to my gigabit switch, so both xeons would 
constantly be saturated at 11mbytes a sec. while the p4 was able to reach 
higher speeds of 50-60mbytes a sec with its internal raid 0 array (dual 120gb 
drives) my main pc (athlon 64 3500+) was the namenode and a datanode and also 
the ndfs client, I could not get nutch to work properly with ndfs, it was 
setup correctly, it "kinda" worked but would crash out the namenode when I 
was trying to fetch segments in the ndfs filesystem or index them, or do much 
of anything. so I copied all my segment directories, indexes, 
content.wtahever it was 1.8gb and some dvd images onto ndfs. my primary 
machine and nutch runs off 10000rpm disks raid 0 (2x36gb raptors) they can 
output about 120mbytes a sec sustained so here is what I found out ( in 
windows) if I dont start a datanode on the namenode with the conf pointing to instead of its outside ip the namenode will not copy data to the 
other machines, instead if I'm running datanode on the namenode data will 
replicate from the datanode to the other 3 datanodes, I tried this a hundred 
ways to try and make it work with an independant namenode without luck. but 
the way I saw data go across my network was I would put data into ndfs the 
namenode would request a datanode and find the internal datanode, copy data 
to it only then after that the datanode would still be coping data from my 
other hd's into chunks on the raid array, while copying it would replicate to 
the p4 via gigabit at 50-60mbytes a sec, then it would replicate from the p4 
to the xeons kinda like alternating them as I only had replication at default 
2 and i had about 100gbytes to copy in so the copy would finish onto the 
internal raid array fairly quickly then it finished replication to the p4 and 
the xeons got a little bit of data, but not near as much as the p4, my guess 
is it only needs 2 copies and the first copy was datanode on the internal 
machine, the second was the p4 datanode. the xeons only had a smaller 
connection so they didnt recieve as many chunks as fast as the p4 could, and 
the p4 had enough space for all the data so it worked out, I should of put 
replication to 4. the amd athlon xp 1900+ was running linux suse 9.3 and it 
would crash the namenode on windows if I connected it as a datanode. so that 
one didnt get tested, but I was able to put out 50-60 mbytes a sec to 1 
machine, but it would not replicate data to multiple machines at the same 
time it seemed. I would of thought it would of output to the xeons at the 
same time as the p4, give the xeons 20% of the data and the p4 80% or 
something of that nature, but it could be that they just arent fast enough to 
request data before the p4 was recieving its 32mb chunks every 1/2 second?
The good news cpu usage was only at 50% on my amd 3500+ that was while it was 
copying data to the internal datanode from the ndfs client from another 
internal HD running the namenode and running the datanode internally. does it 
now work with a separate namenode? I'm getting ready to run nutch in linux 
full time, if I can ever get the damn driver for my highpoint 2220 raid card 
to work with suse, any suse, the drivers dont work with dual core cpu's or 
something??? they are working on it, now I'm stuck with fedora 4 untill they 
fix it. so its not ready for testing yet. I'll let you know when I can test 
it in a full linux environment.
wow that was a long one!!!
------- End of Forwarded Message -------

Pound Web Hosting

View raw message