lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adrien Grand (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (LUCENE-7589) Prevent outliers from raising the number of bits of everyone with numeric doc values
Date Fri, 09 Dec 2016 16:57:58 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-7589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Adrien Grand updated LUCENE-7589:
---------------------------------
    Attachment: LUCENE-7589.patch

Here is a patch. The doc values consumer computes space usage both for the case that all values
use the same number of bits per value and for the case that values are split into blocks of
16384 values. And if using blocks proves to save 10% disk usage or more, then it encodes blocks
with their own required number of bits per value.

I kept a rather high value of the block size, since this impl can only jump forward {{blockSize}}
documents at a time, so a high value like 16384 hopefully keeps performance good, but in the
future we might want to look into leveraging the sequential access pattern even more (to do
run-length encoding for instance) and maybe have eg. a skip list to handle the big jumps,
like postings do. I think that patch is a good first (baby) step towards that direction.

> Prevent outliers from raising the number of bits of everyone with numeric doc values
> ------------------------------------------------------------------------------------
>
>                 Key: LUCENE-7589
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7589
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>         Attachments: LUCENE-7589.patch
>
>
> Today we encode entire segments with a single number of bits per value. It was done this
way because it was faster, but it also means a single outlier can significantly increase the
space requirements. I think we should have protection against that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message