lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dawid Weiss <dawid.we...@gmail.com>
Subject Re: Levenshtein FST's?
Date Sat, 28 May 2016 08:33:18 GMT
> Point taken, but I wonder if there's an algorithmic shortcut to determinize
> the union of Levenshtein DFAs...

Levenshtein DFA is an automaton like any other; when you merge two
such automata, they will very likely contain states that need to be
merged (and their transition split) in order to be deterministic
again.

You could try to ways.

1) generate the language (all input strings) accepted by all automata,
then create a deterministic DFA out of that. Strangely enough, with
small inputs and distances, it could be doable,

2) try to union those deterministic automata to some extend, without
making them completely minimal or determnistic. This would require a
different traversal routine (with multiple states in the automaton),
but could be much faster in practice. See papers on re2 regular
expression engine -- they contain interesting ideas and pointers.

3) apply for a PhD to get the union of Levenshtein DFAs minimized
right away in one (simple) step. :)

D.

>
> Here's some stats on unioning n Levenshtein automata (2-edit, with
> transpositions, of names) together (not attempting to determinize them):
> * n=1000 numStates: 337,553 numTransitions: 1,577,237 (Xmx1g OK)
> * n=5000 numStates: 1,620,944 numTransitions: 7,669,610 (Xmx1g fails, Xmx2g
> OK)
> * n=10000 numStates: 3,248,802 numTransitions: 15,377,814 (Xmx2g fails,
> Xmx3g OK)
>
> On Thu, May 26, 2016 at 12:13 PM, Michael McCandless <
> lucene@mikemccandless.com> wrote:
>
>> But how many states does the not-yet-determinized union of 5000+
>> Levenshtein automata contain?
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> On Thu, May 26, 2016 at 12:08 PM, Luke Nezda <lnezda@gmail.com> wrote:
>>
>> > I should note, I know in I can
>> > call Operations.determinize(union, 10_000_000) but union of 5000+
>> > Levenshtein automata seems to require too many states to be tractable,
>> and
>> > that's on the low end of what I'd like to work with.
>> >
>> > On Thu, May 26, 2016 at 9:59 AM, Luke Nezda <lnezda@gmail.com> wrote:
>> >
>> > > I was surprised union of deterministic automata wasn't deterministic
>> too,
>> > > but it would appear so:
>> > >
>> > > public void testUnionOfDeterministicsIsDeterministic() {
>> > >   Automaton mud = new LevenshteinAutomata("mud", true).toAutomaton(1);
>> > >   Automaton mad = new LevenshteinAutomata("mad", true).toAutomaton(1);
>> > >   assertTrue(mud.isDeterministic());
>> > >   assertTrue(mad.isDeterministic());
>> > >
>> > >   Automaton union = Operations.union(mud, mad);
>> > >   assertTrue(union.isDeterministic()); // fails
>> > > }
>> > >
>> > >
>> > > Maybe I have a bad embedded assumption? Or maybe there's an interesting
>> > > opportunity to make union operation try harder to make a deterministic
>> > > result?
>> > >
>> > > On Wed, May 25, 2016 at 3:07 PM, Michael McCandless <
>> > > lucene@mikemccandless.com> wrote:
>> > >
>> > >> Hmm it's curious you found determinization to be so costly since the
>> > >> Levenshtein automata are already determinized.
>> > >>
>> > >> But, yeah, converting THAT to an FST is tricky...
>> > >>
>> > >> Mike McCandless
>> > >>
>> > >> http://blog.mikemccandless.com
>> > >>
>> > >> On Wed, May 25, 2016 at 2:46 PM, Luke Nezda <lnezda@gmail.com>
wrote:
>> > >>
>> > >> > Oof, sounds too tricky for me to justify pursuing right now. 
While
>> > >> > union'ing 10k Levenshtein automata was tractable, seems
>> determinizing
>> > >> the
>> > >> > result is not (NP-hard - oops :)), let alone working out a suitably
>> > >> useful
>> > >> > conversion to an FST.
>> > >> >
>> > >> > Thank you very much for input!
>> > >> >
>> > >> > Kind regards,
>> > >> > - Luke
>> > >> >
>> > >> > On Tue, May 24, 2016 at 6:19 PM, Michael McCandless <
>> > >> > lucene@mikemccandless.com> wrote:
>> > >> >
>> > >> > > In theory if you could construct an FST from your union'd
>> > Levenshtein
>> > >> > > automata in the right format for SynonymFilter then you could
run
>> it
>> > >> > using
>> > >> > > that FST, and it would give you the start/end tokens for
each
>> match,
>> > >> and
>> > >> > > you'd have the output for each match be the original (unedited)
>> > terms.
>> > >> > >
>> > >> > > But I think information was already lost when you did the
initial
>> > >> union,
>> > >> > > i.e. which original term to output, on which arc (s) in the
>> > automaton.
>> > >> > >
>> > >> > > Also you'd have to at least determinize (and maybe you want
to
>> > >> minimize)
>> > >> > > the union'd automata since FST cannot represent non-deterministic
>> > >> > machines.
>> > >> > >
>> > >> > > Possibly you could determinize, reconstruct the lost information,
>> by
>> > >> > > walking the automaton for each original word, intersecting
the
>> > >> > Levenshtein
>> > >> > > automaton for that word, and recording the first arcs you
hit that
>> > has
>> > >> > one
>> > >> > > unique original word as its output, and placing outputs on
those
>> > arcs,
>> > >> > and
>> > >> > > then doing a "rote" conversion to the syn filter's FST format.
>> This
>> > >> part
>> > >> > > sounds tricky :)
>> > >> > >
>> > >> > > Mike McCandless
>> > >> > >
>> > >> > > http://blog.mikemccandless.com
>> > >> > >
>> > >> > > On Tue, May 24, 2016 at 9:07 AM, Luke Nezda <lnezda@gmail.com>
>> > wrote:
>> > >> > >
>> > >> > > > On Tuesday, May 24, 2016, Michael McCandless <
>> > >> > lucene@mikemccandless.com
>> > >> > > > <javascript:_e(%7B%7D,'cvml','lucene@mikemccandless.com');>>
>> > wrote:
>> > >> > > >
>> > >> > > > > This sounds ambitious ;)
>> > >> > > >
>> > >> > > >
>> > >> > > > That's why I was hoping for some advice from y'all :)
>> > >> > > >
>> > >> > > > >
>> > >> > > > > Note that Lucene has Levenshtein automata (not
FSTs).
>> > >> > > >
>> > >> > > >
>> > >> > > > Right, but I thought it might not be too big a leap.
>> > >> > > >
>> > >> > > > >
>> > >> > > > > Can't you just run an AutomatonQuery on your index
once you
>> have
>> > >> > > unioned
>> > >> > > > > all Levenshtein automata into a single one?
>> > >> > > > >
>> > >> > > > > Or is the reason that you want to convert to an
FST so you can
>> > >> > > associate
>> > >> > > > > which unedited form(s) a given document matched?
>> > >> > > >
>> > >> > > >
>> > >> > > > Correct, I want the unedited form(s) as well as the
match
>> > character
>> > >> > > offsets
>> > >> > > > of each match in each document.
>> > >> > > >
>> > >> > > > >
>> > >> > > > > Mike McCandless
>> > >> > > > >
>> > >> > > > > http://blog.mikemccandless.com
>> > >> > > > >
>> > >> > > > > On Mon, May 23, 2016 at 8:59 PM, Luke Nezda <lnezda@gmail.com
>> >
>> > >> > wrote:
>> > >> > > > >
>> > >> > > > > > Hello, all -
>> > >> > > > > >
>> > >> > > > > > I'd like to use Lucene's automaton/FST code
to achieve fast
>> > >> fuzzy
>> > >> > > (OSA
>> > >> > > > > edit
>> > >> > > > > > distance up to 2) search for many (10k+) strings
(knowledge
>> > >> base:
>> > >> > kb)
>> > >> > > > in
>> > >> > > > > > many large strings (docs).
>> > >> > > > > >
>> > >> > > > > > Approach I was thinking of: create Levenshtein
FST with all
>> > >> paths
>> > >> > > > > > associated with unedited form for each kb
key, union all
>> into
>> > >> > single
>> > >> > > > fst,
>> > >> > > > > > search docs for matches in fst in style of
SynonymFilter.
>> > >> > > > > >
>> > >> > > > > > * I created 10k Levenshtein automata from
kb keys  and
>> unioned
>> > >> > them,
>> > >> > > so
>> > >> > > > > > that seems tractable (took 1 minute, ~250MB
ram)
>> > >> > > > > > * SynonymFilter code worked fine to associate
output and
>> > record
>> > >> > match
>> > >> > > > > token
>> > >> > > > > > length.
>> > >> > > > > > * Saw how FuzzySuggester created Levenshtein
automata from
>> > >> > > query/lookup
>> > >> > > > > key
>> > >> > > > > > and intersected that with kb-like fst.
>> > >> > > > > >
>> > >> > > > > > I don't see how to create Levenshtein FSTs
(vs automatons)
>> > >> > > associating
>> > >> > > > > > outputs with unedited form, and union'ing
them together.
>> > >> > > > > >
>> > >> > > > > > Is this a bad idea?  Maybe better idea?
>> > >> > > > > >
>> > >> > > > > > Thanks in advance,
>> > >> > > > > > - Luke
>> > >> > > > > >
>> > >> > > > >
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> > >
>> > >
>> >
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message