lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mihran Shahinian <slowmih...@gmail.com>
Subject Re: Relevancy : Keyword stuffing
Date Mon, 16 Mar 2015 21:40:37 GMT
Thank you Markus and Chris, for pointers.
For SweetSpotSimilarity I am thinking perhaps a set of closed ranges
exposed via similarity config is easier to maintain as data changes than
making adjustments to fit a
function. Another piece of info would've been handy is to know the average
position info + position info for the first few occurrences for each term.
This would allow
perhaps higher boosting for term occurrences earlier in the doc. In my case
extra keywords are towards the end of the doc,but that info does not seem
to be propagated into scorer.
Thanks again,
Mihran



On Mon, Mar 16, 2015 at 1:52 PM, Chris Hostetter <hossman_lucene@fucit.org>
wrote:

>
> You should start by checking out the "SweetSpotSimilarity" .. it was
> heavily designed arround the idea of dealing with things like excessively
> verbose titles, and keyword stuffing in summary text ... so you can
> configure your expectation for what a "normal" length doc is, and they
> will be penalized for being longer then that.  similarly you can say what
> a 'resaonable' tf is, and docs that exceed that would't get added boost
> (which in conjunction with teh lengthNorm penality penalizes docs that
> stuff keywords)
>
>
> https://lucene.apache.org/solr/5_0_0/solr-core/org/apache/solr/search/similarities/SweetSpotSimilarityFactory.html
>
>
> https://lucene.apache.org/core/5_0_0/misc/org/apache/lucene/misc/doc-files/ss.computeLengthNorm.svg
>
> https://lucene.apache.org/core/5_0_0/misc/org/apache/lucene/misc/doc-files/ss.hyperbolicTf.svg
>
>
> -Hoss
> http://www.lucidworks.com/
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message