lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Luke Nezda <lne...@gmail.com>
Subject Re: Levenshtein FST's?
Date Sat, 28 May 2016 15:23:54 GMT
On Sat, May 28, 2016 at 3:33 AM, Dawid Weiss <dawid.weiss@gmail.com> wrote:

> > Point taken, but I wonder if there's an algorithmic shortcut to
> determinize
> > the union of Levenshtein DFAs...
>
> Levenshtein DFA is an automaton like any other; when you merge two
> such automata, they will very likely contain states that need to be
> merged (and their transition split) in order to be deterministic
> again.
>

I think I see this now, and how skipping determinization and matching with
the NFA could easily leave you with an intractable amount of backtracking
for even the simpler binary question of does my input match any of the
automatons I've unioned.

>
> You could try to ways.
>
> 1) generate the language (all input strings) accepted by all automata,
> then create a deterministic DFA out of that. Strangely enough, with
> small inputs and distances, it could be doable,
>

Ok, yea I could see these being the same ones whose union is not too large
to determinize.  Was combinatorially humbling to see how many strings are
accepted by Levenshtein automata using FiniteStringsIterator.

>
> 2) try to union those deterministic automata to some extend, without
> making them completely minimal or determnistic. This would require a
> different traversal routine (with multiple states in the automaton),
> but could be much faster in practice. See papers on re2 regular
> expression engine -- they contain interesting ideas and pointers.
>

re2 does look neat for matching regexes which can be expressed as DFAs.

I think I see now how determinizing the union of (Levenshtein) DFAs likely
has no "algorithmic shortcut" I'll be able to devise :)

>
> 3) apply for a PhD to get the union of Levenshtein DFAs minimized
> right away in one (simple) step. :)
>

I've enjoyed the conversation - thanks for helping me debunk my naive
concept.

>
> D.
>
> >
> > Here's some stats on unioning n Levenshtein automata (2-edit, with
> > transpositions, of names) together (not attempting to determinize them):
> > * n=1000 numStates: 337,553 numTransitions: 1,577,237 (Xmx1g OK)
> > * n=5000 numStates: 1,620,944 numTransitions: 7,669,610 (Xmx1g fails,
> Xmx2g
> > OK)
> > * n=10000 numStates: 3,248,802 numTransitions: 15,377,814 (Xmx2g fails,
> > Xmx3g OK)
> >
> > On Thu, May 26, 2016 at 12:13 PM, Michael McCandless <
> > lucene@mikemccandless.com> wrote:
> >
> >> But how many states does the not-yet-determinized union of 5000+
> >> Levenshtein automata contain?
> >>
> >> Mike McCandless
> >>
> >> http://blog.mikemccandless.com
> >>
> >> On Thu, May 26, 2016 at 12:08 PM, Luke Nezda <lnezda@gmail.com> wrote:
> >>
> >> > I should note, I know in I can
> >> > call Operations.determinize(union, 10_000_000) but union of 5000+
> >> > Levenshtein automata seems to require too many states to be tractable,
> >> and
> >> > that's on the low end of what I'd like to work with.
> >> >
> >> > On Thu, May 26, 2016 at 9:59 AM, Luke Nezda <lnezda@gmail.com> wrote:
> >> >
> >> > > I was surprised union of deterministic automata wasn't deterministic
> >> too,
> >> > > but it would appear so:
> >> > >
> >> > > public void testUnionOfDeterministicsIsDeterministic() {
> >> > >   Automaton mud = new LevenshteinAutomata("mud",
> true).toAutomaton(1);
> >> > >   Automaton mad = new LevenshteinAutomata("mad",
> true).toAutomaton(1);
> >> > >   assertTrue(mud.isDeterministic());
> >> > >   assertTrue(mad.isDeterministic());
> >> > >
> >> > >   Automaton union = Operations.union(mud, mad);
> >> > >   assertTrue(union.isDeterministic()); // fails
> >> > > }
> >> > >
> >> > >
> >> > > Maybe I have a bad embedded assumption? Or maybe there's an
> interesting
> >> > > opportunity to make union operation try harder to make a
> deterministic
> >> > > result?
> >> > >
> >> > > On Wed, May 25, 2016 at 3:07 PM, Michael McCandless <
> >> > > lucene@mikemccandless.com> wrote:
> >> > >
> >> > >> Hmm it's curious you found determinization to be so costly since
> the
> >> > >> Levenshtein automata are already determinized.
> >> > >>
> >> > >> But, yeah, converting THAT to an FST is tricky...
> >> > >>
> >> > >> Mike McCandless
> >> > >>
> >> > >> http://blog.mikemccandless.com
> >> > >>
> >> > >> On Wed, May 25, 2016 at 2:46 PM, Luke Nezda <lnezda@gmail.com>
> wrote:
> >> > >>
> >> > >> > Oof, sounds too tricky for me to justify pursuing right now.
> While
> >> > >> > union'ing 10k Levenshtein automata was tractable, seems
> >> determinizing
> >> > >> the
> >> > >> > result is not (NP-hard - oops :)), let alone working out
a
> suitably
> >> > >> useful
> >> > >> > conversion to an FST.
> >> > >> >
> >> > >> > Thank you very much for input!
> >> > >> >
> >> > >> > Kind regards,
> >> > >> > - Luke
> >> > >> >
> >> > >> > On Tue, May 24, 2016 at 6:19 PM, Michael McCandless <
> >> > >> > lucene@mikemccandless.com> wrote:
> >> > >> >
> >> > >> > > In theory if you could construct an FST from your union'd
> >> > Levenshtein
> >> > >> > > automata in the right format for SynonymFilter then
you could
> run
> >> it
> >> > >> > using
> >> > >> > > that FST, and it would give you the start/end tokens
for each
> >> match,
> >> > >> and
> >> > >> > > you'd have the output for each match be the original
(unedited)
> >> > terms.
> >> > >> > >
> >> > >> > > But I think information was already lost when you did
the
> initial
> >> > >> union,
> >> > >> > > i.e. which original term to output, on which arc (s)
in the
> >> > automaton.
> >> > >> > >
> >> > >> > > Also you'd have to at least determinize (and maybe you
want to
> >> > >> minimize)
> >> > >> > > the union'd automata since FST cannot represent
> non-deterministic
> >> > >> > machines.
> >> > >> > >
> >> > >> > > Possibly you could determinize, reconstruct the lost
> information,
> >> by
> >> > >> > > walking the automaton for each original word, intersecting
the
> >> > >> > Levenshtein
> >> > >> > > automaton for that word, and recording the first arcs
you hit
> that
> >> > has
> >> > >> > one
> >> > >> > > unique original word as its output, and placing outputs
on
> those
> >> > arcs,
> >> > >> > and
> >> > >> > > then doing a "rote" conversion to the syn filter's FST
format.
> >> This
> >> > >> part
> >> > >> > > sounds tricky :)
> >> > >> > >
> >> > >> > > Mike McCandless
> >> > >> > >
> >> > >> > > http://blog.mikemccandless.com
> >> > >> > >
> >> > >> > > On Tue, May 24, 2016 at 9:07 AM, Luke Nezda <lnezda@gmail.com>
> >> > wrote:
> >> > >> > >
> >> > >> > > > On Tuesday, May 24, 2016, Michael McCandless <
> >> > >> > lucene@mikemccandless.com
> >> > >> > > > <javascript:_e(%7B%7D,'cvml','lucene@mikemccandless.com');>>
> >> > wrote:
> >> > >> > > >
> >> > >> > > > > This sounds ambitious ;)
> >> > >> > > >
> >> > >> > > >
> >> > >> > > > That's why I was hoping for some advice from y'all
:)
> >> > >> > > >
> >> > >> > > > >
> >> > >> > > > > Note that Lucene has Levenshtein automata
(not FSTs).
> >> > >> > > >
> >> > >> > > >
> >> > >> > > > Right, but I thought it might not be too big a
leap.
> >> > >> > > >
> >> > >> > > > >
> >> > >> > > > > Can't you just run an AutomatonQuery on your
index once you
> >> have
> >> > >> > > unioned
> >> > >> > > > > all Levenshtein automata into a single one?
> >> > >> > > > >
> >> > >> > > > > Or is the reason that you want to convert
to an FST so you
> can
> >> > >> > > associate
> >> > >> > > > > which unedited form(s) a given document matched?
> >> > >> > > >
> >> > >> > > >
> >> > >> > > > Correct, I want the unedited form(s) as well as
the match
> >> > character
> >> > >> > > offsets
> >> > >> > > > of each match in each document.
> >> > >> > > >
> >> > >> > > > >
> >> > >> > > > > Mike McCandless
> >> > >> > > > >
> >> > >> > > > > http://blog.mikemccandless.com
> >> > >> > > > >
> >> > >> > > > > On Mon, May 23, 2016 at 8:59 PM, Luke Nezda
<
> lnezda@gmail.com
> >> >
> >> > >> > wrote:
> >> > >> > > > >
> >> > >> > > > > > Hello, all -
> >> > >> > > > > >
> >> > >> > > > > > I'd like to use Lucene's automaton/FST
code to achieve
> fast
> >> > >> fuzzy
> >> > >> > > (OSA
> >> > >> > > > > edit
> >> > >> > > > > > distance up to 2) search for many (10k+)
strings
> (knowledge
> >> > >> base:
> >> > >> > kb)
> >> > >> > > > in
> >> > >> > > > > > many large strings (docs).
> >> > >> > > > > >
> >> > >> > > > > > Approach I was thinking of: create Levenshtein
FST with
> all
> >> > >> paths
> >> > >> > > > > > associated with unedited form for each
kb key, union all
> >> into
> >> > >> > single
> >> > >> > > > fst,
> >> > >> > > > > > search docs for matches in fst in style
of SynonymFilter.
> >> > >> > > > > >
> >> > >> > > > > > * I created 10k Levenshtein automata
from kb keys  and
> >> unioned
> >> > >> > them,
> >> > >> > > so
> >> > >> > > > > > that seems tractable (took 1 minute,
~250MB ram)
> >> > >> > > > > > * SynonymFilter code worked fine to associate
output and
> >> > record
> >> > >> > match
> >> > >> > > > > token
> >> > >> > > > > > length.
> >> > >> > > > > > * Saw how FuzzySuggester created Levenshtein
automata
> from
> >> > >> > > query/lookup
> >> > >> > > > > key
> >> > >> > > > > > and intersected that with kb-like fst.
> >> > >> > > > > >
> >> > >> > > > > > I don't see how to create Levenshtein
FSTs (vs
> automatons)
> >> > >> > > associating
> >> > >> > > > > > outputs with unedited form, and union'ing
them together.
> >> > >> > > > > >
> >> > >> > > > > > Is this a bad idea?  Maybe better idea?
> >> > >> > > > > >
> >> > >> > > > > > Thanks in advance,
> >> > >> > > > > > - Luke
> >> > >> > > > > >
> >> > >> > > > >
> >> > >> > > >
> >> > >> > >
> >> > >> >
> >> > >>
> >> > >
> >> > >
> >> >
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message