lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] Commented: (LUCENE-1426) Next steps towards flexible indexing
Date Wed, 22 Oct 2008 09:20:44 GMT


Michael McCandless commented on LUCENE-1426:

bq. TermDocs could have a list of Attributes that the posting list offers.

I like this approach.

Though unlike LUCENE-1422, where Token remains separate from
TokenStream (and I'm still not sure it should be...?), I think for
TermDocs there would not be the analog of a separate Token.
Ie, it would look something like this:

  myPerDocAttr = termDocs.getAttribute(MyPerDoc.class);

  while( {
    x = myPerDocAttr.getValue(...);

However, this form of flexibility is actually beyond what I'm aiming
for, for the first step of reader flexibility (there are so many
facets of "flexible indexing"!).

For starters I'd like to allow flexibility on how you encode the
existing postings (doc/freq/positions/payloads).  Whereas this
flexibility is in extending what stuff is actually stored into & read
from the index.  I think we should do both, but my focus now is on the
first one, specifically to be able to drop in a codec that uses
pulsing, a less RAM-intestive terms dict indexing, and/or PFOR, etc.

> Next steps towards flexible indexing
> ------------------------------------
>                 Key: LUCENE-1426
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>         Attachments: LUCENE-1426.patch
> In working on LUCENE-1410 (PFOR compression) I tried to prototype
> switching the postings files to use PFOR instead of vInts for
> encoding.
> But it quickly became difficult.  EG we currently mux the skip data
> into the .frq file, which messes up the int blocks.  We inline
> payloads with positions which would also mess up the int blocks.
> Skipping offsets and TermInfo offsets hardwire the file pointers of
> frq & prox files yet I need to change these to block + offset, etc.
> Separately this thread also started up, on how to customize how Lucene
> stores positional information in the index:
> So I decided to make a bit more progress towards "flexible indexing"
> by first modularizing/isolating the classes that actually write the
> index format.  The idea is to capture the logic of each (terms, freq,
> positions/payloads) into separate interfaces and switch the flushing
> of a new segment as well as writing the segment during merging to use
> the same APIs.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message