nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jay Pound" <>
Subject Re: Memory usage2
Date Tue, 02 Aug 2005 19:43:59 GMT
whats the bottleneck for the slow searching, I'm monitoring it and its doing
about 57% cpu load when I'm searching , it takes about 50secs to bring up
the results page the first time, then if I search for the same thing again
its much faster.
Doug, can I trash my segments after they are indexed, I don't want to have
cached access to the pages do the segments still need to be there? my 30mil
page index/segment is using over 300gb I have the space, but when I get to
the hundreds of millions of pages I will run out of room on my raid
controler's for hd expansion, I'm planning on moving to lustre if ndfs is
not stable by then. I plan on having a multi billion page index if the
memory requirements for that can be below 16gb per search node. right now
I'm getting pretty crappy results from my 30 million pages, I read the
whitepaper on Authoritative Sources in a Hyperlinked Environment because
someone said thats how the nutch algorithm worked, so I'm assuming as my
index grows the pages that deserve top placement will recieve top placement,
but I don't know if I should re-fetch a new set of segments with root url's
just ending in US extensions( etc...) I made a small set testing
this theory (100000 pages) and its results were much better than my results
from the 30mill page index. whats your thought on this, am I right in
thinking that the pages with the most pages linking to them will show up
first? so if I index 500 million pages my results should be on par with the
rest of the "big dogs"?

one last important question, if I merge my indexes will it search faster
than if I don't merge them, I currently have 20 directories of 1-1.7mill
pages each.
and if I split up these indexes across multiple machines will the searching
be faster, I couldent get the nutch-server to work but I'm using 0.6.

I have a very fast server I didnt know if the searching would take advantage
of smp, fetching will and I can run multiple index's at the same time. my HD
array is 200MB a sec i/o I have the new dual core opteron 275 italy core
with 4gb ram, working my way to 16gb when I need it and a second processor
when I need it, 1.28TB of hd space for nutch currently with expansion up to
5.12TB, I'm currently running windows 2000 on it as they havent made a
driver yet for suse 9.3 for my raid cards (highpoint 2220) so my scalability
will be to 960MB a sec with all the drives in the system and 4x2.2 Ghz
processor cores. untill I need to cluster thats what I have to play with for
in case you guys needed to know what hardware I'm running
Thank you
-Jay Pound
BTW windows 2000 is not 100% stable with dual core processors. nutch is ok
but cant do too many things at once or I'll get a kernel inpage error (guess
its time to migrate to server-damn)
----- Original Message ----- 
From: "Doug Cutting" <>
To: <>
Sent: Tuesday, August 02, 2005 1:53 PM
Subject: Re: Memory usage

> Try the following settings in your nutch-site.xml:
> <property>
>    <name></name>
>    <value>7</value>
> </property>
> <property>
>    <name>indexer.termIndexInterval</name>
>    <value>1024</value>
> </property>
> The first causes data files to use considerably less memory.
> The second affects index creation, so must be done before you create the
> index you search.  It's okay if your segment indexes were created
> without this, you can just (re-)merge indexes and the merged index will
> get the setting and use less memory when searching.
> Combining these two I have searched a 40+M page index on a machine using
> about 500MB of RAM.  That said, search times with such a large index are
> not good.  At some point, as your collection grows, you will want to
> merge multiple indexes containing different subsets of segments and put
> each on a separate box and search them with distributed search.
> Doug
> Jay Pound wrote:
> > I'm testing an index of 30 million pages, it requires 1.5gb of ram to
> > using tomcat 5, I plan on having an index with multiple billion pages,
> > if this is to scale then even with 16GB of ram I wont be able to have an
> > index larger than 320million pages? how can I distribute the memory
> > requirements across multiple machines, or is there another servlet
> > (like resin) that will require less memory to operate, has anyone else
> > into this?
> > Thanks,
> > -Jay Pound
> >
> >

View raw message