lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eks Dev (JIRA)" <>
Subject [jira] Commented: (LUCENE-1426) Next steps towards flexible indexing
Date Mon, 20 Oct 2008 19:29:44 GMT


Eks Dev commented on LUCENE-1426:

Just a few random thoughts on this topic

- I am sure I read somewhere in these pdfs that were floating around that it would make sense
to use VInts for very short postings and PFOR for the rest. I just do not remember rationale
behind it.   

- During omitTf() discussion, we came up with cool idea to actually inline very short postings
into term dict instead of storing offset. This way we spare one seek per term in many cases,
as well as some space for storing offset. I do not know if this is a problem, but sounds reasonable.
With standard Zipfian distribution, a lot of postings should get inlined. Use cases where
we have query expansion on many terms (think spell checker, synonyms ...) should benefit from
that heavily. These postings are small but there is a lot of them, so it adds up... seek is
deadly :)

I am sorry to miss the party here with PFOR, but let us hope this credit crunch gets over
soon so I that I could dedicate some time to fun things like this :)

cheers, eks 


> Next steps towards flexible indexing
> ------------------------------------
>                 Key: LUCENE-1426
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>         Attachments: LUCENE-1426.patch
> In working on LUCENE-1410 (PFOR compression) I tried to prototype
> switching the postings files to use PFOR instead of vInts for
> encoding.
> But it quickly became difficult.  EG we currently mux the skip data
> into the .frq file, which messes up the int blocks.  We inline
> payloads with positions which would also mess up the int blocks.
> Skipping offsets and TermInfo offsets hardwire the file pointers of
> frq & prox files yet I need to change these to block + offset, etc.
> Separately this thread also started up, on how to customize how Lucene
> stores positional information in the index:
> So I decided to make a bit more progress towards "flexible indexing"
> by first modularizing/isolating the classes that actually write the
> index format.  The idea is to capture the logic of each (terms, freq,
> positions/payloads) into separate interfaces and switch the flushing
> of a new segment as well as writing the segment during merging to use
> the same APIs.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message