nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "EM" <emili...@cpuedge.com>
Subject RE: Memory usage2
Date Tue, 02 Aug 2005 22:26:15 GMT
Why isn't 'analyze' supported anymore?

-----Original Message-----
From: Andy Liu [mailto:andyliu1227@gmail.com] 
Sent: Tuesday, August 02, 2005 5:44 PM
To: nutch-dev@lucene.apache.org
Subject: Re: Memory usage2

I have found that merging indexes does help performance significantly.

If you're not using the cached pages for anything, I believe you can
delete the /content directory for each segment and the engine should
work fine (test before you try for real!)  However, if you ever have
to reindex the segments for whatever reason, you'll run into problems
without the /content dirs.

Nutch doesn't use the HITS algorithm.  Nutch's analyze phase was based
off of PageRank, but it's no longer supported.  By default Nutch
boosts documents based on the # of incoming links, which works well in
small document collections, but is not a robust method in a whole-web
environment.  In terms of search quality, Nutch would not be able to
hang with the "big dogs" of search just yet.  There's still much work
that needs to be done in the area of search quality and spamming.

Andy


Mime
View raw message