lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] Commented: (LUCENE-2843) Add variable-gap terms index impl.
Date Mon, 03 Jan 2011 11:58:46 GMT


Michael McCandless commented on LUCENE-2843:

As a first test, I just made a policy that's identical to the fixed
gap terms index, ie, it just picks every 32nd term as the index term.
So this is really a test of the packed int/bytes vs FST.

On the 10M Wikipedia test index, the resulting terms index files (=
RAM used by SegmentReader) is ~38% smaller (~52% once optimized -- FST
"scales up" well).

Here's the query perf vs trunk:

||Query||QPS base||QPS vargap||Pct diff||||
|spanFirst(unit, 5)|17.13|16.75|{color:red}-2.2%{color}|
|"unit state"~3|5.31|5.20|{color:red}-2.1%{color}|
|spanNear([unit, state], 10, true)|4.59|4.52|{color:red}-1.4%{color}|
|"unit state"|7.86|7.77|{color:red}-1.1%{color}|
|+nebraska +state|204.74|202.85|{color:red}-0.9%{color}|
|+unit +state|11.37|11.30|{color:red}-0.6%{color}|
|doctimesecnum:[10000 TO 60000]|9.74|9.76|{color:green}0.2%{color}|
|unit state|10.73|10.93|{color:green}1.9%{color}|

It's great that for the seek intensive fuzzy queries, the FST-based
seeking is substantially faster.  For other queries the term seek time
is in the noise.

I think we should make this (VariableGapTermsIndex) terms index impl
the default (for Standard codec).

> Add variable-gap terms index impl.
> ----------------------------------
>                 Key: LUCENE-2843
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.0
>         Attachments: LUCENE-2843.patch
> PrefixCodedTermsReader/Writer (used by all "real" core codecs) already
> supports pluggable terms index impls.
> The only impl we have now is FixedGapTermsIndexReader/Writer, which
> picks every Nth (default 32) term and holds it in efficient packed
> int/byte arrays in RAM.  This is already an enormous improvement (RAM
> reduction, init time) over 3.x.
> This patch adds another impl, VariableGapTermsIndexReader/Writer,
> which lets you specify an arbitrary IndexTermSelector to pick which
> terms are indexed, and then uses an FST to hold the indexed terms.
> This is typically even more memory efficient than packed int/byte
> arrays, though, it does not support ord() so it's not quite a fair
> comparison.
> I had to relax the terms index plugin api for
> PrefixCodedTermsReader/Writer to not assume that the terms index impl
> supports ord.
> I also did some cleanup of the FST/FSTEnum APIs and impls, and broke
> out separate seekCeil and seekFloor in FSTEnum.  Eg we need seekFloor
> when the FST is used as a terms index but seekCeil when it's holding
> all terms in the index (ie which SimpleText uses FSTs for).

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message