lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dawid Weiss (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-8084) FSTs can be very space-inefficient on array-expanded nodes
Date Wed, 13 Dec 2017 09:28:00 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-8084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16288954#comment-16288954
] 

Dawid Weiss commented on LUCENE-8084:
-------------------------------------

Hi Mike! I did see the todos in the logic of whether to expand a node or not, but I honestly
didn't have an idea what would be a good improvement. We sometimes do need those branches
to be expanded for fast lookups, especially close to the root level (where the cost of a linear
scan is huge when you consider millions of lookups). In fact, even the limit on the root arc
cache is something we had to bypass manually (root level fanout is huge if you have varied
int keys) and we expanded all of those arcs. 

Perhaps we have a specific use case, I don't know. I'll experiment a bit and see, just wanted
to mark the fact that the current storage of expanded nodes is fairly inefficient sometimes.

> FSTs can be very space-inefficient on array-expanded nodes
> ----------------------------------------------------------
>
>                 Key: LUCENE-8084
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8084
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Dawid Weiss
>            Priority: Minor
>         Attachments: capture-4.png
>
>
> We have FSTs which operate on a larger alphabet (keys in int) space and emit character
sequence outputs. I noticed that certain nodes get expanded into fixed-size arrays to accelerate
lookups (binary search). This has a potential problem, however, when the outputs emit larger
blobs of data (say, one of the outputs is very long, all the others are small). Then the fixed-size
array is very much overallocated, as evident on the attached picture.
> I wonder if it'd be better to encode the array as fixed-size, but without the outputs
and use a local fixed-size pointer into a "value pool" somewhere next to the node's arcs.
Theoretically this "value pool" could even be shared by all of automaton's nodes (saved once
at the end or flushed periodically). 
> Just a thought.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message