lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ahmet Arslan <iori...@yahoo.com.INVALID>
Subject Re: getting number of terms in a document/field
Date Fri, 06 Feb 2015 13:51:59 GMT
Hi Michael,

Thanks for the explanation. I am working with a TREC dataset, 
since it is static, I set size of that array experimentally. 

I followed the DefaultSimilarity#lengthNorm method a bit.

If default similarity and no index time boost is used, 
I assume that norm equals to  1.0 / Math.sqrt(numTerms).

First option is somehow obtain pre-computed norm value and apply reverse operation to obtain
numTerms.
numTerms = (1/norm)^2  This will be an approximation because norms are stored in a byte.
How do I access that norm value for a given docid and a field?

Second option, I store numTerms as a separate field, like any other organic fields.
Do I need to calculate it by myself? Or can I access above already computed numTerms value
during indexing? 

I think I will follow second option.
Is there a pointer where reading/writing a DocValues based field example is demostrated?

Thanks,
Ahmet


On Friday, February 6, 2015 11:08 AM, Michael McCandless <lucene@mikemccandless.com>
wrote:
How will you know how large to allocate that array?  The within-doc
term freq can in general be arbitrarily large...

Lucene does not directly store the total number of terms in a
document, but it does store it approximately in the doc's norm value.
Maybe you can use that?  Alternatively, you can store this statistic
yourself, e.g as a doc value.

Mike McCandless

http://blog.mikemccandless.com



On Thu, Feb 5, 2015 at 7:24 PM, Ahmet Arslan <iorixxx@yahoo.com.invalid> wrote:
> Hello Lucene Users,
>
> I am traversing all documents that contains a given term with following code :
>
> Term term = new Term(field, word);
> Bits bits = MultiFields.getLiveDocs(reader);
> DocsEnum docsEnum = MultiFields.getTermDocsEnum(reader, bits, field, term.bytes());
>
> while (docsEnum.nextDoc() != DocsEnum.NO_MORE_DOCS) {
>
> array[docsEnum.freq()]++;
>
> // how to retrieve term count for this document?
>    xxxxx(docsEnum.docID(), field);
>
>
> }
>
> How can I get field term count values for these documents using Lucene 4.10.3?
>
> Is above code OK for traversing posting list of term?
>
> Thanks,
> Ahmet
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message