lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Iterating TermsEnum for Long field produces zero values at the end
Date Tue, 18 Nov 2014 17:15:49 GMT
FieldCache is (will be?) already gone in 5.0: it's moved to the "misc"
module.  It is slow the first time you use it since it must walk all
postings doing the inversion.  It is also a heap hog compared to doc
values which get more dev attention and try to be more careful in how
they spend heap.

If you are accessing doc values in a custom collector or custom sort,
it's best to use the LeafReader.  But if e.g. you are just pulling
handful of values for the current page after search is done, then
MultiDocValues is OK (it's slower per-lookup than using the leaf's doc
values directly).

Mike McCandless

http://blog.mikemccandless.com


On Tue, Nov 18, 2014 at 7:56 AM, Barry Coughlan <b.coughlan2@gmail.com> wrote:
> Never mind, I got it: MultiDocValues.getNumericValues(final IndexReader r,
> final String field)
>
> Barry
>
> On Tue, Nov 18, 2014 at 12:05 PM, Barry Coughlan <b.coughlan2@gmail.com>
> wrote:
>
>> Hi Michael,
>>
>> Indexing:
>>
>>     private NumericDocValuesField idField = new
>> NumericDocValuesField("id", 0);
>>
>> Reading:
>>
>>     private NumericDocValues cacheDocIds() throws IOException {
>>         AtomicReader wrapped = SlowCompositeReaderWrapper.wrap(reader);
>>         return DocValues.getNumeric(wrapped, "id");
>>     }
>>
>>
>> I'm just putting this here for others because it's hard to find up-to-date
>> examples of using DocValues.
>>
>> Two quick questions:
>>
>> 1. Do you suggest I use DocValues because intended to eventually replace
>> FieldCache?
>> 2. Is it preferable  to use reader.leaves() instead of
>> SlowCompositeReaderWrapper here and somehow merge the segments?
>>
>> Thanks for all your help.
>>
>> Barry
>>
>>
>>
>>
>> On Mon, Nov 17, 2014 at 8:37 PM, Michael McCandless <
>> lucene@mikemccandless.com> wrote:
>>
>>> It's better to use doc values than field cache, if you can.
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>
>>>
>>> On Mon, Nov 17, 2014 at 2:55 PM, Barry Coughlan <b.coughlan2@gmail.com>
>>> wrote:
>>> > Makes sense, thanks. I switched the implementation to a FieldCache with
>>> no
>>> > noticeable performance difference:
>>> >
>>> > private Longs cacheDocIds() throws IOException {
>>> >     AtomicReader wrapped = SlowCompositeReaderWrapper.wrap(reader);
>>> >     Longs vals = FieldCache.DEFAULT.getLongs(wrapped, "id", false);
>>> >     return vals;
>>> > }
>>> >
>>> > Regards,
>>> > Barry
>>> >
>>> > On Mon, Nov 17, 2014 at 6:50 PM, Uwe Schindler <uwe@thetaphi.de> wrote:
>>> >
>>> >> Hi,
>>> >>
>>> >> > It is expected: those are the "prefix" terms, which come after
all
>>> the
>>> >> full-
>>> >> > precision numeric terms.
>>> >> >
>>> >> > But I'm not sure why you see 0s ... the bytes should be unique
for
>>> every
>>> >> term
>>> >> > you get back from the TermsEnum.
>>> >>
>>> >> That's easy to explain:
>>> >>
>>> >> The lower precision terms at the end have more than one doc in the
>>> >> DocsEnum, you always return only the first (Lucene docid 0, you never
>>> list
>>> >> all other entries in DocsEnum). The prefixcoded term has a shift
>>> value> 0
>>> >> and because bits are stripped from the right, the small long values
>>> will
>>> >> therefore return 0L after decoding.
>>> >>
>>> >> In general to have such a type of cache, I would not use terms and
>>> instead
>>> >> use numeric docvalues. An alternative is to use FieldCache, which does
>>> the
>>> >> right thing automatically. Relying on the internal implementation of
>>> >> numeric terms is not a good idea.
>>> >>
>>> >> Uwe
>>> >>
>>> >> > On Mon, Nov 17, 2014 at 10:39 AM, Barry Coughlan
>>> >> > <b.coughlan2@gmail.com> wrote:
>>> >> > > Hi all,
>>> >> > >
>>> >> > > I'm using 4.10.2. I have a Long "id" field. Each document
has one
>>> "id"
>>> >> > > value. I am creating a look-up between Lucene's internal document
>>> id
>>> >> > > and my "id" values by enumerating the inverted index:
>>> >> > >
>>> >> > >     private long[] cacheDocIds() throws IOException {
>>> >> > >         long[] ourIds = new long[reader.maxDoc()];
>>> >> > >
>>> >> > >         Bits liveDocs = MultiFields.getLiveDocs(reader);
>>> >> > >         Fields fields = MultiFields.getFields(reader);
>>> >> > >         Terms terms = fields.terms("id");
>>> >> > >
>>> >> > >         TermsEnum iterator = terms.iterator(null);
>>> >> > >         BytesRef bytesRef = null;
>>> >> > >         while ((bytesRef = iterator.next()) != null) {
>>> >> > >             DocsEnum docsEnum = iterator.docs(liveDocs, null,
>>> >> > > DocsEnum.FLAG_NONE);
>>> >> > >
>>> >> > >             int luceneId = docsEnum.nextDoc();
>>> >> > >             long ourId = NumericUtils.prefixCodedToLong(bytesRef);
>>> >> > >             System.out.println(luceneId + " " + ourId);
>>> >> > >             ourIds[luceneId] = ourId;
>>> >> > >         }
>>> >> > >
>>> >> > >         return ourIds;
>>> >> > >     }
>>> >> > >
>>> >> > > With 5 documents (1, 2, 3, 4, 5) I get this output from the
above
>>> code:
>>> >> > >
>>> >> > > 0 1
>>> >> > > 1 2
>>> >> > > 2 3
>>> >> > > 3 4
>>> >> > > 4 5
>>> >> > > 0 0
>>> >> > > 0 0
>>> >> > > 0 0
>>> >> > >
>>> >> > > I don't understand why there are three zeroes at the end.
>>> >> > >
>>> >> > > - reader.maxDoc is 5 and no documents have been deleted.
>>> >> > > - I have tried this with a varying number of documents and
there
>>> are
>>> >> > > always three zeroes at the end.
>>> >> > > - I tried changing version to Lucene 4.10.0 and Lucene 4.9
and the
>>> >> > > same behavior occurs.
>>> >> > >
>>> >> > > I can work around this with but I'm just curious if this behavior
>>> is
>>> >> > > expected?
>>> >> > >
>>> >> > > Regards,
>>> >> > > Barry
>>> >> >
>>> >> > ---------------------------------------------------------------------
>>> >> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> >> > For additional commands, e-mail: java-user-help@lucene.apache.org
>>> >>
>>> >>
>>> >> ---------------------------------------------------------------------
>>> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>>> >>
>>> >>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message