lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson (JIRA)" <>
Subject [jira] [Created] (SOLR-6888) Find a way to avoid decompressing entire blocks for just the docId/uniqueKey
Date Thu, 25 Dec 2014 01:33:13 GMT
Erick Erickson created SOLR-6888:

             Summary: Find a way to avoid decompressing entire blocks for just the docId/uniqueKey
                 Key: SOLR-6888
             Project: Solr
          Issue Type: Improvement
    Affects Versions: 5.0, Trunk
            Reporter: Erick Erickson
            Assignee: Erick Erickson

Assigning this to myself to just not lose track of it, but I won't be working on this in the
near term; anyone feeling ambitious should feel free to grab it.

Note, docId used here is whatever is defined for <uniqueKey>...

Since Solr 4.1, the compression/decompression process is based on 16K blocks and is automatic,
and not configurable. So, to get a single stored value one must decompress an entire 16K block.
At least.

For SolrCloud (and distributed processing in general), we make two trips, one to get the doc
id and score (or other sort criteria) and one to return the actual data.

The first pass here requires that we return the top N docIDs and sort criteria, which means
that each and every sub-request has to unpack at least one 16K block (and sometimes more)
to get just the doc ID. So if we have 20 shards and only want 20 rows, 95% of the decompression
cycles will be wasted. Not to mention all the disk reads.

It seems like we should be able to do better than that. Can we argue that doc ids are 'special'
and should be cached somehow? Let's discuss what this would look like. I can think of a couple
of approaches:

1> Since doc IDs are "special", can we say that for this purpose returning the indexed
version is OK? We'd need to return the actual stored value when the full doc was requested,
but for the sub-request only what about returning the indexed value instead of the stored
one? On the surface I don't see a problem here, but what do I know? Storing these as DocValues
seems useful in this case.

1a> A variant is treating numeric docIds specially since the indexed value and the stored
value should be identical. And DocValues here would be useful it seems. But this seems an
unnecessary specialization if <1> is implemented well.

2> We could cache individual doc IDs, although I'm not sure what use that really is. Would
maintaining the cache overwhelm the savings of not decompressing? I really don't like this
idea, but am throwing it out there. Doing this from stored data up front would essentially
mean decompressing every doc so that seems untenable to try up-front.

3> We could maintain an array[maxDoc] that held document IDs, perhaps lazily initializing
it. I'm not particularly a fan of this either, doesn't seem like a Good Thing. I can see lazy
loading being almost, but not quite totally, useless, i.e. a hit ratio near 0, especially
since it'd be thrown out on every openSearcher.

Really, the only one of these that seems viable is <1>/<1a>. The others would
all involve decompressing the docs anyway to get the ID, and I suspect that caching would
be of very limited usefulness. I guess <1>'s viability hinges on whether, for internal
use, the indexed form of DocId is interchangeable with the stored value.

Or are there other ways to approach this? Or isn't it something to really worry about?

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message