nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fredrik Andersson <fidde.anders...@gmail.com>
Subject Re: Strange search results
Date Wed, 03 Aug 2005 20:09:46 GMT
Thanks for that explanation Chirag, that was what I was looking for.

I use a pretty pimped up segment and index - I'm not using Nutch for
traditional webpages but for accessing other types of data. So I will
still want to index all of the fields, it's just that a particular
search should apply only to certain attributes of the data. I'll just
"deboost" the fields I'm currently not interested in, that will work
just fine : )

Big thanks,
Fredrik

On 8/3/05, Chirag Chaman <dev@filangy.com> wrote:
> You should be able to do that by simply changing the boosts in the nutch
> properties file.
> Change your title boost to 3 or 4 and bring down all the other boosts to
> something less than 1.
> 
> Re-indexing is not necessary. You only need to re-index if you want to
> change the boost in the norm field (NOTE: this boost is DIFFERENT from the
> query boost) which is encode into the field and multiplied with the score --
> the query boost is then multiplied to this further.
> 
> The only problem I see is that you don't want to index anything by content
> -- for that you will need to change the query to not look in that field or
> give that a very low boost as well (anything between 0 and 1 is a negative
> boost). AFAIK, to change the content part you will need to modify the query
> code.
> 
> 
> 
> 
> -----Original Message-----
> From: Fredrik Andersson [mailto:fidde.andersson@gmail.com]
> Sent: Wednesday, August 03, 2005 2:23 PM
> To: nutch-dev@lucene.apache.org
> Subject: Re: Strange search results
> 
> While on the topic guys, if you require another weighting scheme than the
> default one, will a re-indexing really be necessary? I'm currently trying to
> search just some of the fields. For instance, I'd like to base the hits
> entirely on the page title, not by anchor text, contents or other factors. I
> thought this would be a matter of hacking the searcher-part of Nutch, not
> the index, but I haven't figured it out yet. Any wise words on this problem?
> 
> Fredrik
> 
> On 8/3/05, Howie Wang <howie_wang@hotmail.com> wrote:
> > Thanks for the tips, Andy and Chirag! It saves me a lot of trouble.
> > I'll tweak the boosting for anchors and re-index and see where it gets
> > me.
> >
> > Thanks,
> > Howie
> >
> >
> > >Concur with Andy on both points -- Unfortunately, there is no way to
> > >"go back" and remove either of these values without reindexing, so
> > >let me save you the trouble if you were thinking of changing the
> > >similarity class as a workaround.
> > >
> > >IMO, the problem with anchors is that you either need to get them
> > >all, or not get them at all -- getting just a few anchors can give
> > >you really bad results as stuff like "click here" will give pages a
> > >high score that don't contain either of these terms.  Another
> > >approach is to go in the properties file and change the boost of
> > >anchors to 0.05, thus giving them a very very low boost
> > >
> > >Regarding the norm -- this is done at index time for each field.
> > >We've changed the indexing code so that it's always 1
> > >
> > >HTH,
> > >CC
> > >
> > >
> > >-----Original Message-----
> > >From: Andy Liu [mailto:andyliu1227@gmail.com]
> > >Sent: Wednesday, August 03, 2005 8:00 AM
> > >To: nutch-dev@lucene.apache.org
> > >Subject: Re: Strange search results
> > >
> > >The fieldNorm is lengthNorm * document boost.  The final value is
> "rounded"
> > >so that's why you're getting such clean numbers for your fieldNorm.
> > >If you're finding that these pages have too high of a boost, you can
> > >lower indexer.score.power in your conf file.
> > >
> > >As for your problem in #2, look at the explain page to see how your
> > >search result got there.  Maybe there's a high score for an anchor
> > >match.  The anchor text doesn't show up on the text of the page, so maybe
> that's it.
> > >
> > >Andy
> > >
> > >On 8/3/05, Howie Wang <howie_wang@hotmail.com> wrote:
> > > > Hi,
> > > >
> > > > I've been noticing some strange search results recently. I seem to
> > > > be getting two issues.
> > > >
> > > > 1. The fieldNorm for certain terms is unusually high for certain
> > > > sites for anchors and titles. And they are usually just whole
> > > > numbers (4.0, 5.0, etc).
> > > > I find this strange since the lengthNorm used to calculate this is
> > > > very unlikely to result in an integer. It's either
> > > > 1/sqrt(numTokens) or 1/log(e+numTokens). Where is 5.0 coming from?
> > > >
> > > > 2. I'm getting hits for sites that don't contain ANY of the terms
> > > > in my search. This is exacerbated by issue #1 since the fieldNorm
> > > > boosts this page to the top of the results. I thought it might be
> > > > because of my changes for stemming, but this happens for search
> > > > terms that are not changed by stemming at all.
> > > >
> > > > Anyone run into something like this? Any ideas on how to start
> > >debugging?
> > > >
> > > > Thanks,
> > > > Howie
> > > >
> > > >
> > > > Howie
> > > >
> > > >
> > > >
> > >
> > >
> >
> >
> >
> 
> 
>

Mime
View raw message