lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alan Woodward (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-8585) Create jump-tables for DocValues at index-time
Date Fri, 01 Mar 2019 10:14:00 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-8585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16781536#comment-16781536
] 

Alan Woodward commented on LUCENE-8585:
---------------------------------------

Adrian and I had a look at this, and we think it's a test bug - BaseNormsFormatTestCase specifies
NoMergePolicy, and this particular run of the test was using the HandleLimitFS which prevents
more than 2048 file handles being opened, which doesn't sit well with lots of small segments
being generated and never merged.  There's no particular reason for it to not merge things,
though.

We should also change TestLucene70NormsFormat to test the correct Codec.  I'll close this
and open a new issue for these two fixes.

> Create jump-tables for DocValues at index-time
> ----------------------------------------------
>
>                 Key: LUCENE-8585
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8585
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/codecs
>    Affects Versions: 8.0
>            Reporter: Toke Eskildsen
>            Priority: Minor
>              Labels: performance
>             Fix For: 8.0
>
>         Attachments: LUCENE-8585.patch, LUCENE-8585.patch, make_patch_lucene8585.sh
>
>          Time Spent: 10.5h
>  Remaining Estimate: 0h
>
> As noted in LUCENE-7589, lookup of DocValues should use jump-tables to avoid long iterative
walks. This is implemented in LUCENE-8374 at search-time (first request for DocValues from
a field in a segment), with the benefit of working without changes to existing Lucene 7 indexes
and the downside of introducing a startup time penalty and a memory overhead.
> As discussed in LUCENE-8374, the codec should be updated to create these jump-tables
at index time. This eliminates the segment-open time & memory penalties, with the potential
downside of increasing index-time for DocValues.
> The three elements of LUCENE-8374 should be transferable to index-time without much alteration
of the core structures:
>  * {{IndexedDISI}} block offset and index skips: A {{long}} (64 bits) for every 65536
documents, containing the offset of the block in 33 bits and the index (number of set bits)
up to the block in 31 bits.
>  It can be build sequentially and should be stored as a simple sequence of consecutive
longs for caching of lookups.
>  As it is fairly small, relative to document count, it might be better to simply memory
cache it?
>  * {{IndexedDISI}} DENSE (> 4095, < 65536 set bits) blocks: A {{short}} (16 bits)
for every 8 {{longs}} (512 bits) for a total of 256 bytes/DENSE_block. Each {{short}} represents
the number of set bits up to right before the corresponding sub-block of 512 docIDs.
>  The \{{shorts}} can be computed sequentially or when the DENSE block is flushed (probably
the easiest). They should be stored as a simple sequence of consecutive shorts for caching
of lookups, one logically independent sequence for each DENSE block. The logical position
would be one sequence at the start of every DENSE block.
>  Whether it is best to read all the 16 {{shorts}} up front when a DENSE block is accessed
or whether it is best to only read any individual {{short}} when needed is not clear at this
point.
>  * Variable Bits Per Value: A {{long}} (64 bits) for every 16384 numeric values. Each
{{long}} holds the offset to the corresponding block of values.
>  The offsets can be computed sequentially and should be stored as a simple sequence of
consecutive {{longs}} for caching of lookups.
>  The vBPV-offsets has the largest space overhead og the 3 jump-tables and a lot of the
64 bits in each long are not used for most indexes. They could be represented as a simple
{{PackedInts}} sequence or {{MonotonicLongValues}}, with the downsides of a potential lookup-time
overhead and the need for doing the compression after all offsets has been determined.
> I have no experience with the codec-parts responsible for creating index-structures.
I'm quite willing to take a stab at this, although I probably won't do much about it before
January 2019. Should anyone else wish to adopt this JIRA-issue or co-work on it, I'll be happy
to share.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message