lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Trejkaz <trej...@trypticon.org>
Subject Mapping doc values back to doc ID (in decent time)
Date Fri, 07 Aug 2015 06:30:19 GMT
Hi all.

It's that time again.

I'm trying to kill off our long-standing reliance on stable doc IDs.
To that end, I am adding an additional field which contains the ID.
But we use these IDs a lot and for all kinds of purposes, and in some
of these purposes, many lookups are done at once, so performance
starts to matter.

A good example of this is the use case of doing a query in some
external source to get matching IDs, which then have to be converted
to DocIdSet to use them as a Filter.

For doc ID -> our ID, it's simple.

    NumericDocValues values = MultiDocValues.getNumericValues(reader, "doc-id");
    for (int docId = 0; docId < count; docId++)
    {
       int ourId = values.get(docId); // only ever one value
    }

This is pretty quick. 106.7 ms for 10 million lookups.

For our ID -> doc ID, I can't figure out how to get decent speed.
IndexSearcher is phenomenally slow (as expected), so I tried working
with Terms and PostingsEnum directly:

    Terms terms = MultiFields.getTerms(reader, "doc-id");
    assertNotNull(terms);
    BytesRefBuilder builder = new BytesRefBuilder();
    TermsEnum termsEnum = terms.iterator();
    PostingsEnum postingsEnum = null;

    for (int ourId = 0; ourId < count; ourId++)
    {
        builder.clear();
        NumericUtils.longToPrefixCoded(ourId, 0, builder);
        termsEnum.seekExact(builder.get());
        postingsEnum = termsEnum.postings(null, postingsEnum);
        int docId = postingsEnum.nextDoc(); // only ever one value
    }

This is slow - 38 seconds for the same items. Of course, I tried using
MemoryPostingsFormat to speed this up. It did help a little, bringing
the time down to around 10 seconds.

But that is still slow. The SQL query returning the same items,
iterating the result set to pull the results out and stuffing them
into the bit set takes 260ms. I'd rather not map them to actual doc
IDs if it's going to make queries an order of magnitude slower.

If I look at what Lucene is doing, most of the time seems to go here:

    at org.apache.lucene.util.fst.FST.readNextRealArc(FST.java:1097)
    at org.apache.lucene.util.fst.FST.findTargetArc(FST.java:1271)
    at org.apache.lucene.util.fst.FST.findTargetArc(FST.java:1195)
    at org.apache.lucene.util.fst.FSTEnum.doSeekExact(FSTEnum.java:441)
    at org.apache.lucene.util.fst.BytesRefFSTEnum.seekExact(BytesRefFSTEnum.java:84)
    at org.apache.lucene.codecs.memory.MemoryPostingsFormat$FSTTermsEnum.seekExact(MemoryPostingsFormat.java:800)
    at org.apache.lucene.index.MultiTermsEnum.seekExact(MultiTermsEnum.java:159)

The next approach on my list is the sledgehammer - walk the entire
DocValues, build the reverse mapping myself and stuff a cache of that
somewhere tied to the LeafReader. Maybe try to implement PostingsEnum
and Terms for that and then put it on our reader which wraps the real
one.

Before I resort to that, though, I want to check whether Lucene has a
better way to do this than what I can find by reading the docs.
I had a look at SortedNumericDocValues but the sorting it has doesn't
appear to be the sort which I would be able to use for this.

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message