lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From scott chu (朱炎詹) <scott....@udngroup.com>
Subject Re: Why it's boosted up?
Date Wed, 25 Aug 2010 01:07:10 GMT
Thanks for your clear explanation! I got it :)
----- Original Message ----- 
From: "MitchK" <mitch91@web.de>
To: <solr-user@lucene.apache.org>
Sent: Tuesday, August 24, 2010 3:37 PM
Subject: Re: Why it's boosted up?


>
> Hi Scott,
>
>
>
>> (so  shorter fields are automatically boosted up). "
>>
> The theory behind that is the following (in easy words):
> Let's say you got two documents, each doc contains on 1 field (like it was
> in my example).
> Additionally we got a query that contains two words.
> Let's say doc1 contains on 10 words and doc2 contains on 20 words.
> The query matches both docs with both words.
> The idea of boosting shorter fields stronger than longer fields is the
> following:
> In doc1, 2/10 = 0.2 => 20% of the words are matching your query.
> In doc2 2/20 = 0.1 => 10% of the words are matching your query.
>
> So doc1 should get a better score, because the rate of matching words vs 
> the
> total number of occuring words is greater than in doc2
> This is the idea of using norms as an index-time-boosting-factor. NOTE: 
> This
> does not mean that doc1 get's boosted by 20% and doc1 by 10%! It only
> illustrates what the idea behind such norms is.
>
> From the similarity-class's documentation of lengthNorm():
>
>
>
>> Matches in longer fields are less precise, so implementations of this
>> method usually return smaller values when numTokens is large, and larger
>> values when numTokens is small.
>>
>
> However, you, as a search-application-developer got the task, that you 
> have
> to decide whether this theory applies to your application or not. In some
> cases using norms makes no sense, in others it does.
> If you think that norms are applying to your project, ommitting them is no
> good approach to save disk-space.
> Furthermore: If you think the theory does apply to the business-needs of
> your application but its impact is currently to heavy, you can have a look
> at the sweetSpotSimilarity in Lucene.
>
>
>
>> The request is from our business team, they wish user of our product can
>> type in partial string of a word that exists in title or body field.
>>
> You mean something like typing "note" and also getting results like
> "notebook"?
> The correct approach for something like that is not using shingleFilter 
> but
> NGrams or edged NGrams.
> Shingles are doing something like that:
> "This is my shingle sentence" -> "This is, is my, my shingle, shingle
> sentence" -> it breaks up the sentence into smaller pieces. The benefit of
> doins so is, that, if a query matches one of these shingles, you have 
> found
> a short phrase without using the performance-consuming 
> phraseQuery-feature.
>
> Kind regards,
> - Mitch
>
>
> scott chu wrote:
>>
>> In Lucene's web page, there's a paragraph:
>>
>> "Indexing time boosts are preprocessed for storage efficiency and written
>> to
>> the directory (when writing the document) in a single byte (!) as 
>> follows:
>> For each field of a document, all boosts of that field (i.e. all boosts
>> under the same field name in that doc) are multiplied. The result is
>> multiplied by the boost of the document, and also multiplied by a "field
>> length norm" value that represents the length of that field in that doc
>> (so
>> shorter fields are automatically boosted up). "
>>
>> I though the greater the value, the boosting is upper. Then why short
>> fields
>> are boost up? Isn't Norm value for short fields smaller?
>>
>>
>>
> -- 
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Doing-Shingle-but-also-keep-special-single-word-tp1241204p1306419.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


--------------------------------------------------------------------------------



¥¼¦b¶Ç¤J°T®§¤¤§ä¨ì¯f¬r¡C
Checked by AVG - www.avg.com
Version: 9.0.851 / Virus Database: 271.1.1/3090 - Release Date: 08/24/10 
02:34:00


Mime
View raw message