lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adrien Grand (JIRA)" <>
Subject [jira] [Updated] (LUCENE-7563) BKD index should compress unused leading bytes
Date Mon, 05 Dec 2016 11:09:59 GMT


Adrien Grand updated LUCENE-7563:
    Attachment: LUCENE-7563-prefixlen-unary.patch

The change looks good and the drop is quite spectacular.
:-) I think there is just a redundant arraycopy in {{clone()}}?

For the record, I played with another idea leveraging the fact that the prefix lengths on
two consecutive levels are likely close to each other, and the most common values for the
deltas are 0, then 1, then -1. So we might be able to do more savings by encoding the delta
between consecutive prefix length using unary coding on top of zig-zag encoding, which would
allow to encode 0 on 1 bit, 1 on 2 bits, 2 on 3 bits, etc. However it only saved 1% memory
on IndexOSM and less than 1% on IndexTaxis. I'm attaching it here if someone wants to have
a look but I don't think the gains are worth the complexity.

> BKD index should compress unused leading bytes
> ----------------------------------------------
>                 Key: LUCENE-7563
>                 URL:
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>             Fix For: master (7.0), 6.4
>         Attachments: LUCENE-7563-prefixlen-unary.patch, LUCENE-7563.patch, LUCENE-7563.patch,
LUCENE-7563.patch, LUCENE-7563.patch
> Today the BKD (points) in-heap index always uses {{dimensionNumBytes}} per dimension,
but if e.g. you are indexing {{LongPoint}} yet only use the bottom two bytes in a given segment,
we shouldn't store all those leading 0s in the index.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message