I've got 400mill db i can run this against over the
next few days.
-byron
--- Stefan Groschupf <sg@media-style.com> wrote:
> Hi Andrzej,
>
> wow are really great news!
> > Using the optimized index, I reported previously
> that some of the
> > top-scoring results were missing. As it happens,
> the missing
> > results were typically the "junk" pages with high
> tf/idf but low
> > "boost". Since we collect up to N hits, going from
> higher to lower
> > "boost" values, the "junk" pages with low "boost"
> value were
> > automatically eliminated. So, overall the
> subjective quality of
> > results was improved. On the other hand, some of
> the legitimate
> > results with a decent "boost" values were also
> skipped because they
> > didn't fit within the fixed number of hits... ah,
> well. Perhaps we
> > should limit the number of hits in
> LimitedCollector using a cutoff
> > "boost" value, and not the maximum number of hits
> (or maybe both?).
>
> As far we experiment it would be good to have booth.
>
> > To conclude, I will add the IndexSorter.java to
> the core classes,
> > and I suggest to continue the experiments ...
>
> May someone out there in the community has a
> commercial search engine
> running (e.g. google appliance or similar) so we may
> can setup a
> nutch with the same pages and compare the results.
> I guess it will be difficult to compare nutch with
> yahoo or google
> since nobody of us has a 4 billion index up and
> running. I would run
> one on my laptop but I do not have the bandwidth to
> fetch until next
> two days. :-D
> Great work!
>
> Cheers,
> Stefan
>
|