nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fredrik Andersson <fidde.anders...@gmail.com>
Subject Re: Memory usage2
Date Tue, 02 Aug 2005 21:08:52 GMT
Hi Jay!

Why not use the "Google approach" and buy lots of cheap
workstations/servers to distribute the search on? You can really get
away cheap these days, compared to high-end servers. Even if NDFS and
isn't fully up to par in 0.7-dev yet, you can still move your indices
around to separate computers and distribute them that way.  Writing a
small client/server for this purpose can be done in a matter of hours.
Gathering as much data as you have on one server sounds like a bad
idea to me, no matter how monstrous that server is.

Regarding the HITS algorithm - check out the example on the Nutch
website for the Internet crawl, where you select the top scorers after
you finished a segment (of arbitrary size), and continue on crawling
from those high-ranking sites. That way you will get the most
authorative sites in your index first, which is good.

Good night,
Fredrik

On 8/2/05, Jay Pound <webmaster@poundwebhosting.com> wrote:
> ....
> one last important question, if I merge my indexes will it search faster
> than if I don't merge them, I currently have 20 directories of 1-1.7mill
> pages each.
> and if I split up these indexes across multiple machines will the searching
> be faster, I couldent get the nutch-server to work but I'm using 0.6.
> ...
> Thank you
> -Jay Pound
> Fromped.com
> BTW windows 2000 is not 100% stable with dual core processors. nutch is ok
> but cant do too many things at once or I'll get a kernel inpage error (guess
> its time to migrate to 2003.net server-damn)
> ----- Original Message -----
> From: "Doug Cutting" <cutting@nutch.org>
> To: <nutch-dev@lucene.apache.org>
> Sent: Tuesday, August 02, 2005 1:53 PM
> Subject: Re: Memory usage
> 
> 
> > Try the following settings in your nutch-site.xml:
> >
> > <property>
> >    <name>io.map.index.skip</name>
> >    <value>7</value>
> > </property>
> >
> > <property>
> >    <name>indexer.termIndexInterval</name>
> >    <value>1024</value>
> > </property>
> >
> > The first causes data files to use considerably less memory.
> >
> > The second affects index creation, so must be done before you create the
> > index you search.  It's okay if your segment indexes were created
> > without this, you can just (re-)merge indexes and the merged index will
> > get the setting and use less memory when searching.
> >
> > Combining these two I have searched a 40+M page index on a machine using
> > about 500MB of RAM.  That said, search times with such a large index are
> > not good.  At some point, as your collection grows, you will want to
> > merge multiple indexes containing different subsets of segments and put
> > each on a separate box and search them with distributed search.
> >
> > Doug
> >
> > Jay Pound wrote:
> > > I'm testing an index of 30 million pages, it requires 1.5gb of ram to
> search
> > > using tomcat 5, I plan on having an index with multiple billion pages,
> but
> > > if this is to scale then even with 16GB of ram I wont be able to have an
> > > index larger than 320million pages? how can I distribute the memory
> > > requirements across multiple machines, or is there another servlet
> program
> > > (like resin) that will require less memory to operate, has anyone else
> run
> > > into this?
> > > Thanks,
> > > -Jay Pound
> > >
> > >
> >
> >
> 
> 
>

Mime
View raw message