nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Re: IndexOptimizer (Re: Lucene performance bottlenecks)
Date Mon, 12 Dec 2005 17:50:57 GMT
Doug Cutting wrote:

> Yes, this is why I was discouraged and stopped working on this.
>
> However I am now hopeful that sorting the entire index by page score 
> and using top-1000 might work well with Nutch queries, since page 
> score is field-independent, and I think fields cause the problems.  
> Plus, this would be a lot simpler than the cross-field summing 
> described above.
>
> I can start writing an index-sorter today, unless you are already 
> working on this.  If you have an evaluation framework, that would be 
> great.


By all means please start, this is still near the limits of my knowledge 
of Lucene... ;-)

My testing framework consists of a bunch of Beanshell scripts, and a 
test index that I know of (which I'm not at liberty to share). But I can 
prepare another index, based e.g. on the Reuters corpus, and clean up 
the scripts somewhat.

I'm interested in following this up and contributing to a usable conclusion.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Mime
View raw message