lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] Commented: (LUCENE-2843) Add variable-gap terms index impl.
Date Mon, 03 Jan 2011 19:20:47 GMT


Michael McCandless commented on LUCENE-2843:

bq. Just curious, how would the 'let FST decide' work?

The FST builder is able to prune-as-it-builds.  EG it prunes a node if the number of unique
terms going through it is less than N.  Alternatively, it prunes if the node just before had
< N nodes coming through.  To do this we'd pass all terms to the builder, and specify the
prune threshold.  So the FST would be "bushy"/deep when terms are a high density, and shallowish

Isn't the resulting
FST size also dependent upon the output value (terms dict file pointer)? And if we optimize
this locally (X versus Y) does it tend to hold globally?

Yes, very much so -- the more stuff you store in the output the bigger the FST.  But we only
store the long file pointer into the main terms dict for this usage, and the FST is efficient
(delta-codes the long values).  But, I'm not trying in anyway to minimize that net size (in

> Add variable-gap terms index impl.
> ----------------------------------
>                 Key: LUCENE-2843
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.0
>         Attachments: LUCENE-2843.patch
> PrefixCodedTermsReader/Writer (used by all "real" core codecs) already
> supports pluggable terms index impls.
> The only impl we have now is FixedGapTermsIndexReader/Writer, which
> picks every Nth (default 32) term and holds it in efficient packed
> int/byte arrays in RAM.  This is already an enormous improvement (RAM
> reduction, init time) over 3.x.
> This patch adds another impl, VariableGapTermsIndexReader/Writer,
> which lets you specify an arbitrary IndexTermSelector to pick which
> terms are indexed, and then uses an FST to hold the indexed terms.
> This is typically even more memory efficient than packed int/byte
> arrays, though, it does not support ord() so it's not quite a fair
> comparison.
> I had to relax the terms index plugin api for
> PrefixCodedTermsReader/Writer to not assume that the terms index impl
> supports ord.
> I also did some cleanup of the FST/FSTEnum APIs and impls, and broke
> out separate seekCeil and seekFloor in FSTEnum.  Eg we need seekFloor
> when the FST is used as a terms index but seekCeil when it's holding
> all terms in the index (ie which SimpleText uses FSTs for).

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message