lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] [Commented] (LUCENE-8084) FSTs can be very space-inefficient on array-expanded nodes
Date Wed, 13 Dec 2017 00:55:00 GMT


Michael McCandless commented on LUCENE-8084:

I like that idea!

Maybe in the meantime we should improve the logic that decides if an node should use the dense
encoding to take into account waste due to large outputs?

> FSTs can be very space-inefficient on array-expanded nodes
> ----------------------------------------------------------
>                 Key: LUCENE-8084
>                 URL:
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Dawid Weiss
>            Priority: Minor
>         Attachments: capture-4.png
> We have FSTs which operate on a larger alphabet (keys in int) space and emit character
sequence outputs. I noticed that certain nodes get expanded into fixed-size arrays to accelerate
lookups (binary search). This has a potential problem, however, when the outputs emit larger
blobs of data (say, one of the outputs is very long, all the others are small). Then the fixed-size
array is very much overallocated, as evident on the attached picture.
> I wonder if it'd be better to encode the array as fixed-size, but without the outputs
and use a local fixed-size pointer into a "value pool" somewhere next to the node's arcs.
Theoretically this "value pool" could even be shared by all of automaton's nodes (saved once
at the end or flushed periodically). 
> Just a thought.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message