lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
Date Mon, 15 Jul 2013 12:54:49 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13708428#comment-13708428
] 

Michael McCandless commented on LUCENE-3069:
--------------------------------------------

{quote}
Another thing that surprised me is, with the same code/conf, 
luceneutil creates different sizes of index? I tested 
that df==0 trick several times on wikimedium1m, the 
index size varies from 514M~522M... Will multi-threading affects
much here?
{quote}

Using threads means the docs are assigned to different segments each time you run ... it's
interesting this can cause such variance in the index size though.

It is known that e.g. sorting docs by web site (if you are indexing content from different
sites) can give good compression; maybe that's the effect we're seeing here?
                
> Lucene should have an entirely memory resident term dictionary
> --------------------------------------------------------------
>
>                 Key: LUCENE-3069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3069
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index, core/search
>    Affects Versions: 4.0-ALPHA
>            Reporter: Simon Willnauer
>            Assignee: Han Jiang
>              Labels: gsoc2013
>             Fix For: 4.4
>
>         Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch
>
>
> FST based TermDictionary has been a great improvement yet it still uses a delta codec
file for scanning to terms. Some environments have enough memory available to keep the entire
FST based term dict in memory. We should add a TermDictionary implementation that encodes
all needed information for each term into the FST (custom fst.Output) and builds a FST from
the entire term not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message