nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefan Groschupf ...@media-style.com>
Subject Re: [Nutch-dev] Adding title and site to scoring
Date Tue, 22 Mar 2005 18:12:55 GMT
To manipulate ranking, you can use boosting.
You can boost documents in a index filter extension you realize as a 
plugin.
The problem is that you can not change the boosting field that is 
stored in the index as well (unindexed).
So this may provide trouble until dedub and explaining of ranking..


Am 20.03.2005 um 20:38 schrieb Michael Nebel:

> Hi Piotr,
>
> as I wrote a month ago, I started working at the problem (was it 
> really so long ago :-(. But then real life cought me and when I 
> checked the nutch code again - many parts had changed. But the plugin 
> I started/copied should still work.... perhaps I should give him a new 
> try...
>
> Addding the field to the index is quite trivial, when you just take a 
> short look at Otis and Eriks book "lucene in action" (thanks!) or know 
> about lucene. My next step was to extend the ranking, when I found the 
> mistake in my strategie to add the url-length to the ranking: a string 
> is no integer and the actual ranking works with strings. I think, my 
> last post caused many smiles :-) For my part, I decided to read rest 
> of the book before doing more coding.
>
> Looking at my list, I think, I might return to this problem at the 
> middle / end of the week. Perhaps, we could work together?
>
> Regards
>
> 	Michael
>
>
>
> Piotr Kosiorowski schrieb:
>
>> Hello,
>> I would like to have title and host as separate indexed fields in our 
>> installation. As this topic was already discussed on the list over a 
>> month ago  I want to make sure that  nothing was implemented till now 
>> before I start coding myself. I am working for Sabre Holdings and we 
>> are implementing Travel Search Engine based on nutch. The project is 
>> not at very advanced stage right now, but I am spending majority of 
>> my time working with nutch and I would like to say I am really 
>> impressed with it. As I was looking through our index I saw a lot of 
>> examples where adding host as a separate indexed field will help a 
>> lot with relevancy. The title is much more difficult to judge  - as 
>> Doug wrote - they can be spammed quite easily but it would be nice to 
>> have a separate parameter to control title matching. I was also 
>> thinking about adding special handling for "host" fields as many 
>> companies are concatenating parts of their names in domain name (eg.  
>> http://www.hewlettpackard.com, http://www.arthurandersen.com/ and 
>> even http://www.sabreholdings.com/   :)    ) but I will see how it 
>> works during implementation. What is an opinion of others about such 
>> feature? Does it make sense? I think majority of concatenations would 
>> simply find no matching tokens in host field so it should not affect 
>> search performance heavily.
>> My plan is to start working on it this week, I will submit a patch 
>> when I finish. So is anyone working on it right now or has something 
>> ready?
>> Or any special things I should consider?
>> Regards
>> Piotr Kosiorowski
>> Michael Nebel wrote:
>>> Hi,
>>>
>>> I'm afraid, I'll have to deal with the ranking the next days / 
>>> weekend. So perhaps I can contribute some time and work for all of 
>>> us.
>>>
>>> Before taking the wrong way, some questions in advance:
>>>
>>> - using luke to look at my indexes I see a field called <site>
>>> - some more checking: there is a query-site-plugin.
>>> -> so the "host" part mentioned by Doug below should be available 
>>> right now.
>>>
>>> To take up the note from Wolfgang (boosting short urls), I want to 
>>> add another plugin calculating the url-length and storing it in an 
>>> seperate field. Perhaps it makes sense to generate a third plugin 
>>> storing only the "path" of the url so whe can use the site, the path 
>>> and the total length for the ranking. The title might be a candidate 
>>> for a fourth plugin.
>>>
>>> My next step would be to extend the query-basic-plugin in two ways:
>>>
>>> 1.) read the weights out of the NutchConf
>>> 2.) read the used fields out of the NutchConf
>>>
>>> In result it should be possible to customize the ranking by 
>>> selecting the plugins and editing the config.
>>>
>>> Is this way resonable or do I think too simple?
>>>
>>> Michael
>>>
>>>
>>>
>>> Doug Cutting wrote:
>>>
>>>> Andrzej Bialecki wrote:
>>>>
>>>>> Doug Cutting wrote:
>>>>>
>>>>>>
>>>>>> NutchSimilarity.lengthNorm() penalize short content by 
>>>>>> considering all documents with less than 1000 content tokens to 
>>>>>> be normalized as though they have 1000 content tokens.  Is this 
>>>>>> not sufficient?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Not in my experience. Please consider the following hits (attached 
>>>>> in a file), ordered by score, which I've got from a 5mln pages 
>>>>> index of mostly Swedish sites, for a query "apoteket" ("the 
>>>>> pharmacy" in Swedish). There is clearly something very wrong with 
>>>>> the second hit.
>>>>
>>>>
>>>>
>>>>
>>>> Yes.  If that were a "title" match (which it really is), and titles 
>>>> were boosted less than anchors, then this would probably be third 
>>>> or lower.
>>>>
>>>>>> I don't object to indexing titles in a separate field.  They can

>>>>>> be high quality, but they can also be spammed more easily than 
>>>>>> anchors.  In any case, separately controlling their boost, length

>>>>>> normalization, etc. is probably a good idea.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Ok, I'll prepare a patch for review.
>>>>
>>>>
>>>>
>>>>
>>>> Great!  I'm glad more folks are looking at search result quality.  
>>>> This is very important, and not simple.
>>>>
>>>>> Example: all other things being equal (i.e. the content and 
>>>>> anchors), which url seems to be more representative for the query 
>>>>> "ikea":
>>>>>
>>>>> http://www.ikea.se/something/else.html
>>>>> http://www.something.se/else/ikea.html
>>>>>
>>>>> IMHO the first url should be given a higher score. Currently they 
>>>>> get the same score.
>>>>
>>>>
>>>>
>>>>
>>>> Agreed.  This argues for "host" as a separate indexed field.
>>>
>
>
>
>
---------------------------------------------------------------
company:		http://www.media-style.com
forum:		http://www.text-mining.org
blog:			http://www.find23.net


Mime
View raw message