nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Piotr Kosiorowski <pkosiorow...@gmail.com>
Subject Re: Strange search results
Date Fri, 05 Aug 2005 13:26:58 GMT
Hello,
In my experience it is very important to use anchor text giving it
quite high boost. It allows me to return http://www.aa.com when user
searches for "American Airlines" - without using anchor text it was
impossible to achieve - a lot of sites (spam or not) with "american
airlines"  in url and title were returned first.

So in my opinion for results quality it is important to use anchor
text and also use some techniques to identify spam sites - so anchor
text spamming effect  would be highly reduced. For generic anchor
texts like "Clik here!","here", "Click to Open in a new window" etc -
it is quite easy to remove them during indexing.
We spent some time on cleaning our index from unwanted pages and I
think it was a time well spent.

Regards
Piotr



On 8/3/05, Chirag Chaman <dev@filangy.com> wrote:
> Howie,
> 
> Concur with Andy on both points -- Unfortunately, there is no way to "go
> back" and remove either of these values without reindexing, so let me save
> you the trouble if you were thinking of changing the similarity class as a
> workaround.
> 
> IMO, the problem with anchors is that you either need to get them all, or
> not get them at all -- getting just a few anchors can give you really bad
> results as stuff like "click here" will give pages a high score that don't
> contain either of these terms.  Another approach is to go in the properties
> file and change the boost of anchors to 0.05, thus giving them a very very
> low boost
> 
> Regarding the norm -- this is done at index time for each field. We've
> changed the indexing code so that it's always 1
> 
> HTH,
> CC
> 
> 
> -----Original Message-----
> From: Andy Liu [mailto:andyliu1227@gmail.com]
> Sent: Wednesday, August 03, 2005 8:00 AM
> To: nutch-dev@lucene.apache.org
> Subject: Re: Strange search results
> 
> The fieldNorm is lengthNorm * document boost.  The final value is "rounded"
> so that's why you're getting such clean numbers for your fieldNorm.  If
> you're finding that these pages have too high of a boost, you can lower
> indexer.score.power in your conf file.
> 
> As for your problem in #2, look at the explain page to see how your search
> result got there.  Maybe there's a high score for an anchor match.  The
> anchor text doesn't show up on the text of the page, so maybe that's it.
> 
> Andy
> 
> On 8/3/05, Howie Wang <howie_wang@hotmail.com> wrote:
> > Hi,
> >
> > I've been noticing some strange search results recently. I seem to be
> > getting two issues.
> >
> > 1. The fieldNorm for certain terms is unusually high for certain sites
> > for anchors and titles. And they are usually just whole numbers (4.0,
> > 5.0, etc).
> > I find this strange since the lengthNorm used to calculate this is
> > very unlikely to result in an integer. It's either 1/sqrt(numTokens)
> > or 1/log(e+numTokens). Where is 5.0 coming from?
> >
> > 2. I'm getting hits for sites that don't contain ANY of the terms in
> > my search. This is exacerbated by issue #1 since the fieldNorm boosts
> > this page to the top of the results. I thought it might be because of
> > my changes for stemming, but this happens for search terms that are
> > not changed by stemming at all.
> >
> > Anyone run into something like this? Any ideas on how to start debugging?
> >
> > Thanks,
> > Howie
> >
> >
> > Howie
> >
> >
> >
> 
> 
>

Mime
View raw message