nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jay Pound" <>
Subject Re: near-term plan
Date Thu, 04 Aug 2005 20:16:20 GMT
Doug I also ran into this when I was testing ndfs the system would have to
wait for the namenode to tell the datanodes what data to recieve and which
data to replicate, I'm currently setting up lustre to see how it works, its
at the kernel level that it operates, do you think if the namenode was not
java that it would perform better? I plan on running a system where the
namenode (metadata) server will have to perform thousands of i-o's a
sec,concurrently updating indexes of multiple segments simultaniously,
updating the db on one machine, and fetching multiple segments on multiple
machines, all accessing the same logical filesystem at the same time. the
way that namenode responded it took a few seconds to replicate data to other
datanodes, and it took time to start the copying of data, if writing an
index imagine if you have to wait 1-10 secs per file to be written(if
queued), that will cause serious problems. also I was able to saturate
gigabit with ndfs (well about 50-60MBytes a sec its hard to get better than
that with copper) , it just took a few secs to "ramp up" to speed, thats
including file copying and replication.
PS: where can I find out about the mapreduce, I read the presentations, but
I dont get the core concept of it?

PSS: via chips aernt very fpu powerfull try an opteron for your namenode, I
bet you will see a huge improvement in speed, even over xeon's p4's etc... I
was only able to test 5 machines but I was able to saturate 50-60mb a sec to
each (mainly replication throughput running level 1)

----- Original Message ----- 
From: "Doug Cutting" <>
To: <>
Sent: Thursday, August 04, 2005 3:54 PM
Subject: Re: near-term plan

> Stefan Groschupf wrote:
> >>
> >
> > Can you explan what this means: Page 20:
> > - cheduling is bottleneck, not disk, network or CPU?
> I mean that neither the CPUs, disks or network are at 100% of capacity.
>   Disks are running around 50% busy, CPUs a bit higher, and the network
> switch has lots of bandwidth left.  (Although, if we used multiple racks
> connected with gigabit links, these inter-rack links would already be
> near capacity.)  So sometimes the CPU is busy generating random data and
> stuffing it in a buffer, and sometimes the disk is busy writing data,
> but we're not keeping both busy at the same time all the time.  Perhaps
> if more threads/processes and/or bigger buffers would increase the
> utilization--I have not tried to tune things for this benchmark.  But I
> am not dissapointed with this performance.  Rather, I think that it is
> fast enough so that with real applications, with non-trival map and
> reduce functions, NDFS will not be a bottleneck.
> Doug

View raw message