nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jay Pound" <webmas...@poundwebhosting.com>
Subject NDFS benchmark results
Date Sat, 06 Aug 2005 22:30:22 GMT
ok here it is:

I was seeing the same thing doug was seeing when copying data in and out of
ndfs, 5mb a sec i/o, reminds me of the good old days when there were only
100mbit half-duplex connx. I'm running 3 machines @1000mbit and 2 @100mbit.
now here is where I'm able to see throughput in the 500-600mbit range, while
copying data to the dfs and I shutdown a node it will replicate data at the
same time as transfering data into the dfs it will peak around 53mbytes a
sec, I'm only working with a 1.8gb file this time. the bad news, when doing
a get it will use 100% of the cpu to pull down data @ 100mbit on a gigabit
machine, perhaps there is some code that could be cleaned up in the
org.apache.nutch.fs.TestClient to make this faster, or if it could open
multiple threads for recieving data to distribute the load across all cpu's
in the system. now I was able to see a performance increase per machine
while running multiple datanodes on each box, by this i mean more network
throughput per box, so Doug if you run 4 datanodes per box if your 400gb
drives arent in a raid setup you will see a higher throughput per box for
datanode traffic. Doug I know your allready looking at the namenode to see
how to speed things up, may I request 2 things for NDFS that are going to be
needed.
1.) namenode-please thread out the different sections of the code, make the
replication a single thread, while put and get are seperate threads also,
this should speed things up when working in a large cluster, maybe also
lower the time it takes to respond to putting chunks on the machines, it
seems like it queues the put requests for each datanode, maybe run requests
in parallel for get and put instead of waiting for a response from the
datanode being requested? if I'm wrong on any of this sorry, I'm not a
programmer I dont know how to read the nutch code to see if this is true or
not. otherwise I would know the answer to these.
2.)datanode- please please please put the data into sub-directories like the
way squid does, I really do not want a single directory with a million
file's/chunks in it, reiser will do ok with it, but I'm running multiple
terabytes per datanode in a single logical drive configuration, I dont want
to run the filesystem to its limit and crash and lose all my data because
the machine wont boot (have experience in this area unfortunately).
3.)excellent job on making it much more stable its very close to usable now
as it looks!!
-Jay Pound
PS: Doug I would like to talk with you sometime about this if you have an
opportunity.
PSS: here is a snipit of the -report just if your interested:




Administrator@desk /nutch-ndfs
$ ./bin/nutch org.apache.nutch.fs.TestClient -report
050806 182802 parsing file:/C:/cygwin/nutch-ndfs/conf/nutch-default.xml
050806 182802 parsing file:/C:/cygwin/nutch-ndfs/conf/nutch-site.xml
050806 182803 No FS indicated, using default:OPTERON:9000
050806 182803 Client connection to 10.0.0.101:9000: starting
Total raw bytes: 2449488044032 (2281.26 Gb)
Used raw bytes: 1194891779358 (1112.82 Gb)
% used: 48.78%

Total effective bytes: 4014804878 (3.73 Gb)
Effective replication multiplier: 297.6213827739601
-------------------------------------------------
Datanodes available: 10

Name: CPQ19312594631:7000
Total raw bytes: 39999500288 (37.25 Gb)
Used raw bytes: 14021105091 (13.05 Gb)
% used: 35.05%
Last contact with namenode: Sat Aug 06 18:29:01 EDT 2005

Name: desk:7000
Total raw bytes: 74027487232 (68.94 Gb)
Used raw bytes: 58792909619 (54.75 Gb)
% used: 79.42%
Last contact with namenode: Sat Aug 06 18:29:02 EDT 2005

Name: desk:7001
Total raw bytes: 320070287360 (298.08 Gb)
Used raw bytes: 287845425725 (268.07 Gb)
% used: 89.93%
Last contact with namenode: Sat Aug 06 18:29:01 EDT 2005

Name: desk:7002
Total raw bytes: 250048479232 (232.87 Gb)
Used raw bytes: 248354007613 (231.29 Gb)
% used: 99.32%
Last contact with namenode: Sat Aug 06 18:29:02 EDT 2005

Name: desk:7003
Total raw bytes: 200047001600 (186.30 Gb)
Used raw bytes: 196543472435 (183.04 Gb)
% used: 98.24%
Last contact with namenode: Sat Aug 06 18:29:00 EDT 2005

Name: desk:7004
Total raw bytes: 200047001600 (186.30 Gb)
Used raw bytes: 190989330432 (177.87 Gb)
% used: 95.47%
Last contact with namenode: Sat Aug 06 18:28:59 EDT 2005

Name: desk:7005
Total raw bytes: 200038776832 (186.30 Gb)
Used raw bytes: 81084996239 (75.51 Gb)
% used: 40.53%
Last contact with namenode: Sat Aug 06 18:28:59 EDT 2005

Name: michael-05699cn:7000
Total raw bytes: 160031014912 (149.04 Gb)
Used raw bytes: 46235792507 (43.06 Gb)
% used: 28.89%
Last contact with namenode: Sat Aug 06 18:29:02 EDT 2005

Name: opteron:7000
Total raw bytes: 959914815488 (893.99 Gb)
Used raw bytes: 29605043569 (27.57 Gb)
% used: 3.08%
Last contact with namenode: Sat Aug 06 18:29:02 EDT 2005

Name: quadzilla:7000
Total raw bytes: 45263679488 (42.15 Gb)
Used raw bytes: 41419696128 (38.57 Gb)
% used: 91.50%
Last contact with namenode: Sat Aug 06 18:29:02 EDT 2005

I love cluster filesystems!!! how cool is that



Mime
View raw message