lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <>
Subject Re: search quality - assessment & improvements
Date Mon, 09 Jul 2007 02:28:25 GMT

: Thanks for your comments Chris, and sorry for the delayed

my turn for a delayed response ... i figured there was no rush since you
were offline for 10 days :)

: I didn't try this - passing the computed avg doc length to
: SweetSpotSimilarity (SSS) - it would be interesting to try. I wonder
: how this would perform comparing to the variation of pivoted
: (unique) length normalization that I tried. The difference is
: that SSS punishes docs above and below the range, while with
: pivoted normalization docs above the pivot are punished and
: those below the pivot boosted. Pivoted normalization makes
: more sense to me than SSS.

i guess i'm not following how exactly your pivoted norm calculation works
... it sounds like you are still rewarding 1 term long fields more then
any other length ... is the distinction between your approach and the
default implementation just that the default is a smooth curve,while yours
is two differnet curves -- one below the pivot (average length) and one
above it? ... which functions do you use?

: question is how to compute/store/retrieve this data.
: The way I experimented with it was not focused on efficiency
: but rather on flexibility at search time, my custom analyzer
: counted the number of unique tokens in the document, and finally
: a field was added to the document with this number. At search
: time this field was loaded (for all docs), the average was

One option to avoid that extra work at index building time would be to
use logic like what's in LengthNormModifier to build a cache when the
IndexReader is opened containing the number of terms (either unique or
total depending on wether you use +=freq or ++) in each doc per field.

it's really no different then a FieldCache -- except that the
FieldCache.getCustom API doesn't really give you the means to compute
arbitrary values, but the principle is the same.

: natural way to do this is to have two fields "body" and
: "title", set their boosts 1 for "body" and 3 for "title",
: and then, when one searches the entire document (without
: specifying a field), create a multi field query. Things should
: work fine, - boosts are ok, tf() is by field, so is norm.
: But empirically it doesn't work well. When I modified

were the boosts you are refering to index time boosts or query time
boosts?  if they were index time (and you applied them to every document
since in theory the title of every document is worht 3 times as much as
the the body of that document) then i think your index time boosts wound
up being a complete wash.

assuming you were using query time boosts, and asusming we accept your
premise that in a situation with docA having a 3 word title and a 27 word
body while docB has a 10 word title and a 20 word body, both docs should
score the same on a word that matches in the "body" of each, i would argue
that's something that can prpobably be best solved with Payload queries
where you put a payload on the "title" terms in your field that boost
their score -- not by using multiple fields.

In the absence of Payload queries you might be able to aproximate the same
effects by letting your "body" field use length norm, and having an
isolated title fields which OMIT_NORMs that you query with a higher boost
... in situations where the title is significantly shorter then the body
this should give you roughly the same behavior for the 'body' clause of
your query, but eliminate the effects really short or really long titles
have on your relevancy.

: my index to have a single field in
: which I just multiplied the title 3 times, I got better results.
: containing all the text? Perhaps the loss of information in
: the norms was more damaging with more (smaller) fields? (I had
: four btw) - I don't know.  I just saw that a single field
: is better (when this is possible), and I went on with it.

bear in mind, you didn't just get lengthNorm differneces from this -- you
also got tf/idf differneces as well.

: Now, the new payloads allow to specify boost at token level. So


i think once the Payload stuff really gets shaken out, and it's possible
to set index time boosts on a per term basis, the meme of querying across
several differnet fields with differnet boosts to get better relevancy
scores will change drasticly ... fields will make sense for truely
distinct pieces of information that you might wnat to query in isolation
(title, author, summary, body) but "tricks" for making certain parts of
larger text worth more by splitting it into sepererate fields artificially
will no longer be needed, and the tf/idf values for terms will be more
genuine then they are now.

: So theoretically I agree with you. In reality, I don't know if
: we often get to see examples as that last one. And, it would be

your friends/enemies example may be contrived, but the principle is still
true ... if you have an index about movies with fields for hte title and
the plot synopsis, and the list of the full cast and crew, would it ever
make sense for the numebr of crast members (and thus the totla numbe of
terms in the cast field) to ifluence how searches on the title or synopsis
field work out?

: This is nice. I didn't think of it. It may be nice if instead of
: creating new types of queries the existing ones (Span, Boolean,
: Phrase, Wild) could be somehow "set to" use DocRelativeTfTermQuery
: instead of TermQuery. ?

well ... it might be nice for you :) ... but there is a cross product
issue involved here of all the possible permutations.
DocRelativeTfSimilaity is just one possible type of similarity that people
might want, other people might want DocRelativeIdfSimilaity type
functionality, or DocRelativeCoordSimilaity ... or there might be
differing concepts that conflict (TermSetRelativeCoordSimilaity,
TermRelativeTfSimilaity, etc...) ... not to mention the possible performce
impacts of using these Query subclasses i proposed ... so how do you pick
which one the default "primative" Query classes use?

the best bet is probably to leave the primatives primative, and just make
sure any specialty subclasses we add to the core are easy to
substitute into the QueryParser for novice users.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message