lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doron Cohen <>
Subject Re: search quality - assessment & improvements
Date Mon, 16 Jul 2007 23:41:00 GMT
Chris Hostetter wrote:

> i guess i'm not following how exactly your pivoted norm calculation works
> ... it sounds like you are still rewarding 1 term long fields more then


> any other length ... is the distinction between your approach and the
> default implementation just that the default is a smooth curve,
> while yours
> is two differnet curves -- one below the pivot (average length) and one
> above it? ... which functions do you use?

Basically it is
  (1 - Slope) * Pivot + (Slope) * Doclen
Where Pivot reflects on the average doc length, and
Smaller Slope reduces the amount by which short docs
are preferred over long ones. In collection with very
long documents, a doc shorter than the pivot would be
rewarded, but that same doc would be rewarded relatively
less in a collection with shorter docs. So how much you
reward adapts to the specific collection characteristics,
without knowing these characteristics in advance.

> : question is how to compute/store/retrieve this data.
> : The way I experimented with it was not focused on efficiency
> : but rather on flexibility at search time, my custom analyzer
> : counted the number of unique tokens in the document, and finally
> : a field was added to the document with this number. At search
> : time this field was loaded (for all docs), the average was
> One option to avoid that extra work at index building time would be to
> use logic like what's in LengthNormModifier to build a cache when the
> IndexReader is opened containing the number of terms (either unique or
> total depending on wether you use +=freq or ++) in each doc per field.
> it's really no different then a FieldCache -- except that the
> FieldCache.getCustom API doesn't really give you the means to compute
> arbitrary values, but the principle is the same.

I think both are not good enough for large dynamic collections.
Both are good enough for experiments. But it should be more
efficient in a working dynamic large system.

> : natural way to do this is to have two fields "body" and
> : "title", set their boosts 1 for "body" and 3 for "title",
> : and then, when one searches the entire document (without
> : specifying a field), create a multi field query. Things should
> : work fine, - boosts are ok, tf() is by field, so is norm.
> : But empirically it doesn't work well. When I modified
> were the boosts you are refering to index time boosts or query time
> boosts?  if they were index time (and you applied them to every document
> since in theory the title of every document is worht 3 times as much as
> the the body of that document) then i think your index time boosts wound
> up being a complete wash.

No, they were query time boosts.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message