lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: getting number of terms in a document/field
Date Fri, 06 Feb 2015 15:28:55 GMT
On Fri, Feb 6, 2015 at 8:51 AM, Ahmet Arslan <iorixxx@yahoo.com.invalid> wrote:
> Hi Michael,
>
> Thanks for the explanation. I am working with a TREC dataset,
> since it is static, I set size of that array experimentally.
>
> I followed the DefaultSimilarity#lengthNorm method a bit.
>
> If default similarity and no index time boost is used,
> I assume that norm equals to  1.0 / Math.sqrt(numTerms).
>
> First option is somehow obtain pre-computed norm value and apply reverse operation to
obtain numTerms.
> numTerms = (1/norm)^2  This will be an approximation because norms are stored in a byte.
> How do I access that norm value for a given docid and a field?

See the AtomicReader.getNormValues method.

> Second option, I store numTerms as a separate field, like any other organic fields.
> Do I need to calculate it by myself? Or can I access above already computed numTerms
value during indexing?
>
> I think I will follow second option.
> Is there a pointer where reading/writing a DocValues based field example is demostrated?

You could just make your own Similarity impl, that encodes the norm
directly as a length?  It's a long so you don't have to compress if
you don't want to.

That custom Similarity is passed FieldInvertState which contains the
number of tokens in the current field, so you can just use that
instead of computing it yourself.

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message