nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chirag Chaman" <...@filangy.com>
Subject RE: Strange search results
Date Wed, 03 Aug 2005 18:32:25 GMT
You should be able to do that by simply changing the boosts in the nutch
properties file.
Change your title boost to 3 or 4 and bring down all the other boosts to
something less than 1.

Re-indexing is not necessary. You only need to re-index if you want to
change the boost in the norm field (NOTE: this boost is DIFFERENT from the
query boost) which is encode into the field and multiplied with the score --
the query boost is then multiplied to this further.

The only problem I see is that you don't want to index anything by content
-- for that you will need to change the query to not look in that field or
give that a very low boost as well (anything between 0 and 1 is a negative
boost). AFAIK, to change the content part you will need to modify the query
code.


 

-----Original Message-----
From: Fredrik Andersson [mailto:fidde.andersson@gmail.com] 
Sent: Wednesday, August 03, 2005 2:23 PM
To: nutch-dev@lucene.apache.org
Subject: Re: Strange search results

While on the topic guys, if you require another weighting scheme than the
default one, will a re-indexing really be necessary? I'm currently trying to
search just some of the fields. For instance, I'd like to base the hits
entirely on the page title, not by anchor text, contents or other factors. I
thought this would be a matter of hacking the searcher-part of Nutch, not
the index, but I haven't figured it out yet. Any wise words on this problem?

Fredrik

On 8/3/05, Howie Wang <howie_wang@hotmail.com> wrote:
> Thanks for the tips, Andy and Chirag! It saves me a lot of trouble.
> I'll tweak the boosting for anchors and re-index and see where it gets 
> me.
> 
> Thanks,
> Howie
> 
> 
> >Concur with Andy on both points -- Unfortunately, there is no way to 
> >"go back" and remove either of these values without reindexing, so 
> >let me save you the trouble if you were thinking of changing the 
> >similarity class as a workaround.
> >
> >IMO, the problem with anchors is that you either need to get them 
> >all, or not get them at all -- getting just a few anchors can give 
> >you really bad results as stuff like "click here" will give pages a 
> >high score that don't contain either of these terms.  Another 
> >approach is to go in the properties file and change the boost of 
> >anchors to 0.05, thus giving them a very very low boost
> >
> >Regarding the norm -- this is done at index time for each field. 
> >We've changed the indexing code so that it's always 1
> >
> >HTH,
> >CC
> >
> >
> >-----Original Message-----
> >From: Andy Liu [mailto:andyliu1227@gmail.com]
> >Sent: Wednesday, August 03, 2005 8:00 AM
> >To: nutch-dev@lucene.apache.org
> >Subject: Re: Strange search results
> >
> >The fieldNorm is lengthNorm * document boost.  The final value is
"rounded"
> >so that's why you're getting such clean numbers for your fieldNorm.  
> >If you're finding that these pages have too high of a boost, you can 
> >lower indexer.score.power in your conf file.
> >
> >As for your problem in #2, look at the explain page to see how your 
> >search result got there.  Maybe there's a high score for an anchor 
> >match.  The anchor text doesn't show up on the text of the page, so maybe
that's it.
> >
> >Andy
> >
> >On 8/3/05, Howie Wang <howie_wang@hotmail.com> wrote:
> > > Hi,
> > >
> > > I've been noticing some strange search results recently. I seem to 
> > > be getting two issues.
> > >
> > > 1. The fieldNorm for certain terms is unusually high for certain 
> > > sites for anchors and titles. And they are usually just whole 
> > > numbers (4.0, 5.0, etc).
> > > I find this strange since the lengthNorm used to calculate this is 
> > > very unlikely to result in an integer. It's either 
> > > 1/sqrt(numTokens) or 1/log(e+numTokens). Where is 5.0 coming from?
> > >
> > > 2. I'm getting hits for sites that don't contain ANY of the terms 
> > > in my search. This is exacerbated by issue #1 since the fieldNorm 
> > > boosts this page to the top of the results. I thought it might be 
> > > because of my changes for stemming, but this happens for search 
> > > terms that are not changed by stemming at all.
> > >
> > > Anyone run into something like this? Any ideas on how to start
> >debugging?
> > >
> > > Thanks,
> > > Howie
> > >
> > >
> > > Howie
> > >
> > >
> > >
> >
> >
> 
> 
>



Mime
View raw message