lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <j...@apache.org>
Subject [jira] [Created] (LUCENE-3221) improve docvalues integration with scoring
Date Mon, 20 Jun 2011 14:15:47 GMT
improve docvalues integration with scoring
------------------------------------------

                 Key: LUCENE-3221
                 URL: https://issues.apache.org/jira/browse/LUCENE-3221
             Project: Lucene - Java
          Issue Type: New Feature
          Components: core/index
            Reporter: Robert Muir
             Fix For: flexscoring branch


Currently, the flexscoring branch is limited by the fact that you can at most index one single
byte per-document for scoring within Similarity.

I added a simple test, showing how in your app itself you can index a per-document value (such
as a boost) and then use it in scoring: http://svn.apache.org/repos/asf/lucene/dev/branches/flexscoring/lucene/src/test/org/apache/lucene/search/TestDocValuesScoring.java

However, I think we should generalize this mechanism (note, names of classes can be changed
to whatver makes sense).
In Similarity, instead of byte computeNorm(FieldInvertState), I think we should have void
computeNorm(StatsWriter, FieldInvertState).

Then a Similarity can ask the StatsWriter for instance(s), where an instance is something
like a (name, type, aggregates) pair.
Name would be a simple name like "boost" that the sim later uses to retrieve this docvalue.
type would be something like int8/int32/varint/byte.
aggregates could at first be a boolean or whatever, I think at first we should allow for the
sum be be written (e.g. to provide sum and average).
This would support aggregate statistics such as 'total number of tokens in index' and 'average
length'.

so an example of the new computeNorm or whatever we call it would be
{noformat}
  void computeNorm(StatsWriter writer, FieldInvertState state) {
    writer.getReference("length", INT32, Aggregates.YES).write(state.numTokens);
    writer.getReference("boost", FLOAT32, Aggregates.NO).write(state.boost);
    ...
  }
{noformat}

So these docvalues field names that the Sim writes, I think the sim should be able to reference
them with "relative" names like length and boost.
Whatever we do behind the scenes is an implementation detail.

Also for this to work, I think we need to add int8, int16, int32, ... types to docvalues,
and maybe we should add hasArray()/getArray(). I think
the existing compressed INTS should be kept, but maybe renamed to varint or something like
that. This could still be useful, for example if someone
wants to have "real document lengths" for bm25, but they don't really need a full 32-bit range,
they can make the tradeoff to use packed integers
and load less into ram... but that should be the sim's choice to make.


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message