lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Levenshtein FST's?
Date Thu, 26 May 2016 17:12:02 GMT
Union of determinized automata won't be deterministic in general.  E.g.
imagine the union of "ab" and "ac" .. in this case, the initial node will
have two separate arcs with label "a" leaving.

Mike McCandless

http://blog.mikemccandless.com

On Thu, May 26, 2016 at 10:59 AM, Luke Nezda <lnezda@gmail.com> wrote:

> I was surprised union of deterministic automata wasn't deterministic too,
> but it would appear so:
>
> public void testUnionOfDeterministicsIsDeterministic() {
>   Automaton mud = new LevenshteinAutomata("mud", true).toAutomaton(1);
>   Automaton mad = new LevenshteinAutomata("mad", true).toAutomaton(1);
>   assertTrue(mud.isDeterministic());
>   assertTrue(mad.isDeterministic());
>
>   Automaton union = Operations.union(mud, mad);
>   assertTrue(union.isDeterministic()); // fails
> }
>
>
> Maybe I have a bad embedded assumption? Or maybe there's an interesting
> opportunity to make union operation try harder to make a deterministic
> result?
>
> On Wed, May 25, 2016 at 3:07 PM, Michael McCandless <
> lucene@mikemccandless.com> wrote:
>
> > Hmm it's curious you found determinization to be so costly since the
> > Levenshtein automata are already determinized.
> >
> > But, yeah, converting THAT to an FST is tricky...
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> > On Wed, May 25, 2016 at 2:46 PM, Luke Nezda <lnezda@gmail.com> wrote:
> >
> > > Oof, sounds too tricky for me to justify pursuing right now.  While
> > > union'ing 10k Levenshtein automata was tractable, seems determinizing
> the
> > > result is not (NP-hard - oops :)), let alone working out a suitably
> > useful
> > > conversion to an FST.
> > >
> > > Thank you very much for input!
> > >
> > > Kind regards,
> > > - Luke
> > >
> > > On Tue, May 24, 2016 at 6:19 PM, Michael McCandless <
> > > lucene@mikemccandless.com> wrote:
> > >
> > > > In theory if you could construct an FST from your union'd Levenshtein
> > > > automata in the right format for SynonymFilter then you could run it
> > > using
> > > > that FST, and it would give you the start/end tokens for each match,
> > and
> > > > you'd have the output for each match be the original (unedited)
> terms.
> > > >
> > > > But I think information was already lost when you did the initial
> > union,
> > > > i.e. which original term to output, on which arc (s) in the
> automaton.
> > > >
> > > > Also you'd have to at least determinize (and maybe you want to
> > minimize)
> > > > the union'd automata since FST cannot represent non-deterministic
> > > machines.
> > > >
> > > > Possibly you could determinize, reconstruct the lost information, by
> > > > walking the automaton for each original word, intersecting the
> > > Levenshtein
> > > > automaton for that word, and recording the first arcs you hit that
> has
> > > one
> > > > unique original word as its output, and placing outputs on those
> arcs,
> > > and
> > > > then doing a "rote" conversion to the syn filter's FST format.  This
> > part
> > > > sounds tricky :)
> > > >
> > > > Mike McCandless
> > > >
> > > > http://blog.mikemccandless.com
> > > >
> > > > On Tue, May 24, 2016 at 9:07 AM, Luke Nezda <lnezda@gmail.com>
> wrote:
> > > >
> > > > > On Tuesday, May 24, 2016, Michael McCandless <
> > > lucene@mikemccandless.com
> > > > > <javascript:_e(%7B%7D,'cvml','lucene@mikemccandless.com');>>
> wrote:
> > > > >
> > > > > > This sounds ambitious ;)
> > > > >
> > > > >
> > > > > That's why I was hoping for some advice from y'all :)
> > > > >
> > > > > >
> > > > > > Note that Lucene has Levenshtein automata (not FSTs).
> > > > >
> > > > >
> > > > > Right, but I thought it might not be too big a leap.
> > > > >
> > > > > >
> > > > > > Can't you just run an AutomatonQuery on your index once you
have
> > > > unioned
> > > > > > all Levenshtein automata into a single one?
> > > > > >
> > > > > > Or is the reason that you want to convert to an FST so you can
> > > > associate
> > > > > > which unedited form(s) a given document matched?
> > > > >
> > > > >
> > > > > Correct, I want the unedited form(s) as well as the match character
> > > > offsets
> > > > > of each match in each document.
> > > > >
> > > > > >
> > > > > > Mike McCandless
> > > > > >
> > > > > > http://blog.mikemccandless.com
> > > > > >
> > > > > > On Mon, May 23, 2016 at 8:59 PM, Luke Nezda <lnezda@gmail.com>
> > > wrote:
> > > > > >
> > > > > > > Hello, all -
> > > > > > >
> > > > > > > I'd like to use Lucene's automaton/FST code to achieve
fast
> fuzzy
> > > > (OSA
> > > > > > edit
> > > > > > > distance up to 2) search for many (10k+) strings (knowledge
> base:
> > > kb)
> > > > > in
> > > > > > > many large strings (docs).
> > > > > > >
> > > > > > > Approach I was thinking of: create Levenshtein FST with
all
> paths
> > > > > > > associated with unedited form for each kb key, union all
into
> > > single
> > > > > fst,
> > > > > > > search docs for matches in fst in style of SynonymFilter.
> > > > > > >
> > > > > > > * I created 10k Levenshtein automata from kb keys  and
unioned
> > > them,
> > > > so
> > > > > > > that seems tractable (took 1 minute, ~250MB ram)
> > > > > > > * SynonymFilter code worked fine to associate output and
record
> > > match
> > > > > > token
> > > > > > > length.
> > > > > > > * Saw how FuzzySuggester created Levenshtein automata from
> > > > query/lookup
> > > > > > key
> > > > > > > and intersected that with kb-like fst.
> > > > > > >
> > > > > > > I don't see how to create Levenshtein FSTs (vs automatons)
> > > > associating
> > > > > > > outputs with unedited form, and union'ing them together.
> > > > > > >
> > > > > > > Is this a bad idea?  Maybe better idea?
> > > > > > >
> > > > > > > Thanks in advance,
> > > > > > > - Luke
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message