nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Howie Wang" <howie_w...@hotmail.com>
Subject RE: Strange search results
Date Wed, 03 Aug 2005 16:25:18 GMT
Thanks for the tips, Andy and Chirag! It saves me a lot of trouble.
I'll tweak the boosting for anchors and re-index and see where it
gets me.

Thanks,
Howie


>Concur with Andy on both points -- Unfortunately, there is no way to "go
>back" and remove either of these values without reindexing, so let me save
>you the trouble if you were thinking of changing the similarity class as a
>workaround.
>
>IMO, the problem with anchors is that you either need to get them all, or
>not get them at all -- getting just a few anchors can give you really bad
>results as stuff like "click here" will give pages a high score that don't
>contain either of these terms.  Another approach is to go in the properties
>file and change the boost of anchors to 0.05, thus giving them a very very
>low boost
>
>Regarding the norm -- this is done at index time for each field. We've
>changed the indexing code so that it's always 1
>
>HTH,
>CC
>
>
>-----Original Message-----
>From: Andy Liu [mailto:andyliu1227@gmail.com]
>Sent: Wednesday, August 03, 2005 8:00 AM
>To: nutch-dev@lucene.apache.org
>Subject: Re: Strange search results
>
>The fieldNorm is lengthNorm * document boost.  The final value is "rounded"
>so that's why you're getting such clean numbers for your fieldNorm.  If
>you're finding that these pages have too high of a boost, you can lower
>indexer.score.power in your conf file.
>
>As for your problem in #2, look at the explain page to see how your search
>result got there.  Maybe there's a high score for an anchor match.  The
>anchor text doesn't show up on the text of the page, so maybe that's it.
>
>Andy
>
>On 8/3/05, Howie Wang <howie_wang@hotmail.com> wrote:
> > Hi,
> >
> > I've been noticing some strange search results recently. I seem to be
> > getting two issues.
> >
> > 1. The fieldNorm for certain terms is unusually high for certain sites
> > for anchors and titles. And they are usually just whole numbers (4.0,
> > 5.0, etc).
> > I find this strange since the lengthNorm used to calculate this is
> > very unlikely to result in an integer. It's either 1/sqrt(numTokens)
> > or 1/log(e+numTokens). Where is 5.0 coming from?
> >
> > 2. I'm getting hits for sites that don't contain ANY of the terms in
> > my search. This is exacerbated by issue #1 since the fieldNorm boosts
> > this page to the top of the results. I thought it might be because of
> > my changes for stemming, but this happens for search terms that are
> > not changed by stemming at all.
> >
> > Anyone run into something like this? Any ideas on how to start 
>debugging?
> >
> > Thanks,
> > Howie
> >
> >
> > Howie
> >
> >
> >
>
>



Mime
View raw message