lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ahmet Arslan <iori...@yahoo.com.INVALID>
Subject Re: getting number of terms in a document/field
Date Sat, 07 Feb 2015 12:57:59 GMT
Hi,

Sorry for my ignorance, how do I obtain AtomicReader from a IndexReader?

I figured above code but it gives me a list of atomic readers.

for (AtomicReaderContext context : reader.leaves()) {

NumericDocValues docValues = context.reader().getNormValues(field);

if (docValues != null) 
normValue = docValues.get(docID);
}

I implemented a custom similarity you advised by merging tfidf similarity and default similarity.
computeNorm(FieldInvertState state) method was final in tfidf similarity so I just couldn't
extend it.
I was able to retrieve those long values from a single segment index, but i didn't like this
solution.
Because I am experimenting with different similarity implementations.

It looks like there is no easy way to access 
FieldInvertState.lenght() and index this value into an independent NumericDocValues, say numTerms,
other than norms.


I think I will compute length of fields by myself.

Thanks,
Ahmet


On Friday, February 6, 2015 5:31 PM, Michael McCandless <lucene@mikemccandless.com>
wrote:
On Fri, Feb 6, 2015 at 8:51 AM, Ahmet Arslan <iorixxx@yahoo.com.invalid> wrote:
> Hi Michael,
>
> Thanks for the explanation. I am working with a TREC dataset,
> since it is static, I set size of that array experimentally.
>
> I followed the DefaultSimilarity#lengthNorm method a bit.
>
> If default similarity and no index time boost is used,
> I assume that norm equals to  1.0 / Math.sqrt(numTerms).
>
> First option is somehow obtain pre-computed norm value and apply reverse operation to
obtain numTerms.
> numTerms = (1/norm)^2  This will be an approximation because norms are stored in a byte.
> How do I access that norm value for a given docid and a field?

See the AtomicReader.getNormValues method.

> Second option, I store numTerms as a separate field, like any other organic fields.
> Do I need to calculate it by myself? Or can I access above already computed numTerms
value during indexing?
>
> I think I will follow second option.
> Is there a pointer where reading/writing a DocValues based field example is demostrated?

You could just make your own Similarity impl, that encodes the norm
directly as a length?  It's a long so you don't have to compress if
you don't want to.

That custom Similarity is passed FieldInvertState which contains the
number of tokens in the current field, so you can just use that
instead of computing it yourself.


Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


 


On Friday, February 6, 2015 5:31 PM, Michael McCandless <lucene@mikemccandless.com>
wrote:
On Fri, Feb 6, 2015 at 8:51 AM, Ahmet Arslan <iorixxx@yahoo.com.invalid> wrote:
> Hi Michael,
>
> Thanks for the explanation. I am working with a TREC dataset,
> since it is static, I set size of that array experimentally.
>
> I followed the DefaultSimilarity#lengthNorm method a bit.
>
> If default similarity and no index time boost is used,
> I assume that norm equals to  1.0 / Math.sqrt(numTerms).
>
> First option is somehow obtain pre-computed norm value and apply reverse operation to
obtain numTerms.
> numTerms = (1/norm)^2  This will be an approximation because norms are stored in a byte.
> How do I access that norm value for a given docid and a field?

See the AtomicReader.getNormValues method.

> Second option, I store numTerms as a separate field, like any other organic fields.
> Do I need to calculate it by myself? Or can I access above already computed numTerms
value during indexing?
>
> I think I will follow second option.
> Is there a pointer where reading/writing a DocValues based field example is demostrated?

You could just make your own Similarity impl, that encodes the norm
directly as a length?  It's a long so you don't have to compress if
you don't want to.

That custom Similarity is passed FieldInvertState which contains the
number of tokens in the current field, so you can just use that
instead of computing it yourself.


Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message