lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler (JIRA)" <>
Subject [jira] Commented: (LUCENE-2792) Add a simple FST impl to Lucene
Date Sat, 11 Dec 2010 11:16:03 GMT


Uwe Schindler commented on LUCENE-2792:

Hi Mike,
my problem is that I have no time to closely look into it and I don't understand the whole
thing :-) (you cannot always understand everything *g*). So I have no idea how the generics
*should* look like. I don't even understand all the SuppressWarnings currently in the code,
because in my opinion, generics warnings cannot occur at all those places. From a first look
it seems that a few places using untyped Output, but for that the methods need to be generified
(a generic <T> before the return type like in AttributeSource). The "copy" in one of
the TODOs cannot be avoided by generics, because the generics are type erasure, so it would
not help in that method (only that you may not need to add so many casts, which on the other
hand the compiler will add).

> Add a simple FST impl to Lucene
> -------------------------------
>                 Key: LUCENE-2792
>                 URL:
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.0
>         Attachments: FSTExample.png, LUCENE-2792.patch, LUCENE-2792.patch
> I implemented the algo described at
> for
> incrementally building a finite state transducer (FST) from sorted
> inputs.
> This is not a fully general FST impl -- it's only able to build up an
> FST incrementally from input/output pairs that are pre-sorted.
> Currently the inputs are BytesRefs, and the outputs are pluggable --
> NoOutputs gets you a simple FSA, PositiveIntOutputs maps to a long,
> ByteSequenceOutput maps to a BytesRef.
> The implementation has a low memory overhead, so that it can handle a
> fairly large set of terms.  For example, it can build the FSA for the
> 9.8M terms from a 10M document wikipedia index in ~8 seconds (on
> beast), using ~256 MB peak RAM, resulting in an FSA that's ~60 MB.
> It packs the FST as-it-builds into a compact byte[], and then exposes
> the API to read nodes/arcs directly from the byte[].  The FST can be
> quickly saved/loaded to/from a Directory since it's just a big byte[].
> The format is similar to what Morfologik uses
> (
> I think there are a number of possible places we can use this in
> Lucene.  For example, I think many apps could hold the entire terms
> dict in RAM, either at the multi-reader level or maybe per-segment
> (mapping to file offset or to something else custom to the app), which
> may possibly be a good speedup for certain MTQs (though, because the
> format is packed into a byte[], there is a decode cost when visiting
> arcs).
> The builder can also prune as it goes, so you get a prefix trie pruned
> according to how many terms run through the nodes, which makes it
> faster and even less memory consuming.  This may be useful as a
> replacement for our current binary search terms index since it can
> achieve higher term density for the same RAM consumption of our
> current index.
> As an initial usage to make sure this is exercised, I cutover the
> SimpleText codec, which currently fully loads all terms into a
> TreeMap (and has caused intermittent OOME in some tests), to use an FST
> instead.  SimpleText uses a PairOutputs which is able to "pair up" any
> two other outputs, since it needs to map each input term to an int
> docFreq and long filePosition.
> All tests pass w/ SimpleText forced codec, and I think this is
> committable except I'd love to get some help w/ the generics
> (confession to the policeman: I had to add
> @SuppressWarnings({"unchecked"})) all over!!  Ideally an FST is
> parameterized by its output type (Integer, BytesRef, etc.).
> I even added a new @nightly test that makes a largeish set of random
> terms and tests the resulting FST on different outputs :)
> I think it would also be easy to make a variant that uses char[]
> instead of byte[] as its inputs, so we could eg use this during analysis
> (Robert's idea).  It's already be easy to have a CharSequence
> output type since the outputs are pluggable.
> Dawid Weiss (author of HPPC -- -- and
> Morfologik --
> was very helpful iterating with me on this (thank you!).

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message