nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki>
Subject Re: IndexOptimizer (Re: Lucene performance bottlenecks)
Date Tue, 13 Dec 2005 14:43:00 GMT
Doug Cutting wrote:

> Andrzej Bialecki wrote:
>> Shouldn't this be combined with a HitCollector that collects only the 
>> first-n matches? Otherwise we still need to scan the whole posting 
>> list...
> Yes.  I was just posting the work-in-progress.

Ok, I just tested IndexSorter for now. It appears to work correctly, at 
least I get exactly the same results, with the same scores and the same 
explanations, if I run the smae queries on the original and on the 
sorted index. For now, the query response time is identical as far as I 
can tell.

> We will also need to estimate the total number of matches by 
> extrapolating linearly from the maximum doc id processed.

...which should be reported by the custom HitCollector, right?

> Finally, it is probably rather slow for large indexes, whose .fdt 
> won't fit in memory.  A simple way to improve that might be to use 
> Similarity.floatToByte-encoded floats when sorting (e.g., the norm 
> from an untokenized field) so that 

Yes, for an index that was 5 mln docs the IndexOptimizer takes ~10 min. 
to complete, this IndexSorter took over 1 hour...

Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration  Contact: info at sigram dot com

View raw message