lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
Date Tue, 16 Jul 2013 15:56:49 GMT


Michael McCandless commented on LUCENE-3069:

bq. However, seekExact(BytesRef, TermsState) simply 'copy' the value of termState to enum,
which doesn't actually operate 'seek' on dictionary.

This is normal / by design.  It's so that the case of seekExact(TermState) followed by .docs
or .docsAndPositions is fast.  We only need to re-load the metadata if the caller then tries
to do .next()

bq. Maybe instead of term and meta members, we could just hold the current pair?

Oh, yes, I once thought about this, but not sure: like, can the callee always makes sure that,
when 'term()' is called, it will always return a valid term?
The codes in MemoryPF just return 'pair.output' regardless whether pair==null, is it safe?

We can't guarantee that, but I think we can just check if pair == null and return null from

By the way, for real data, when two outputs are not 'NO_OUTPUT', even they contains the same
metadata + stats, 
it seems to be very seldom that their arcs can be identical on FST (increases less than 1MB
for wikimedium1m if 
equals always return false for non-singleton argument). Therefore... yes, hashCode() isn't
necessary here.
Hmm, but it seems like we should implement it?  Ie we do get a smaller FST when implementing
> Lucene should have an entirely memory resident term dictionary
> --------------------------------------------------------------
>                 Key: LUCENE-3069
>                 URL:
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index, core/search
>    Affects Versions: 4.0-ALPHA
>            Reporter: Simon Willnauer
>            Assignee: Han Jiang
>              Labels: gsoc2013
>             Fix For: 4.4
>         Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, LUCENE-3069.patch,
> FST based TermDictionary has been a great improvement yet it still uses a delta codec
file for scanning to terms. Some environments have enough memory available to keep the entire
FST based term dict in memory. We should add a TermDictionary implementation that encodes
all needed information for each term into the FST (custom fst.Output) and builds a FST from
the entire term not just the delta.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message