nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Byron Miller <byronmh...@yahoo.com>
Subject Re: IndexSorter optimizer
Date Wed, 21 Dec 2005 23:55:09 GMT
I've got 400mill db i can run this against over the
next few days.

-byron

--- Stefan Groschupf <sg@media-style.com> wrote:

> Hi Andrzej,
> 
> wow are really great news!
> > Using the optimized index, I reported previously
> that some of the  
> > top-scoring results were missing. As it happens,
> the missing  
> > results were typically the "junk" pages with high
> tf/idf but low  
> > "boost". Since we collect up to N hits, going from
> higher to lower  
> > "boost" values, the "junk" pages with low "boost"
> value were  
> > automatically eliminated. So, overall the
> subjective quality of  
> > results was improved. On the other hand, some of
> the legitimate  
> > results with a decent "boost" values were also
> skipped because they  
> > didn't fit within the fixed number of hits... ah,
> well. Perhaps we  
> > should limit the number of hits in
> LimitedCollector using a cutoff  
> > "boost" value, and not the maximum number of hits
> (or maybe both?).
> 
> As far we experiment it would be good to have booth.
> 
> > To conclude, I will add the IndexSorter.java to
> the core classes,  
> > and I suggest to continue the experiments ...
> 
> May someone out there in the community has a
> commercial search engine  
> running (e.g. google appliance or similar) so we may
> can setup a  
> nutch with the same pages and compare the results.
> I guess it will be difficult to compare nutch with
> yahoo or google  
> since nobody of us has a 4 billion index up and
> running. I would run  
> one on my laptop but I do not have the bandwidth to
> fetch until next  
> two days. :-D
> Great work!
> 
> Cheers,
> Stefan 
> 


Mime
View raw message