lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Luke Nezda <lne...@gmail.com>
Subject Re: Levenshtein FST's?
Date Thu, 26 May 2016 14:59:51 GMT
I was surprised union of deterministic automata wasn't deterministic too,
but it would appear so:

public void testUnionOfDeterministicsIsDeterministic() {
  Automaton mud = new LevenshteinAutomata("mud", true).toAutomaton(1);
  Automaton mad = new LevenshteinAutomata("mad", true).toAutomaton(1);
  assertTrue(mud.isDeterministic());
  assertTrue(mad.isDeterministic());

  Automaton union = Operations.union(mud, mad);
  assertTrue(union.isDeterministic()); // fails
}


Maybe I have a bad embedded assumption? Or maybe there's an interesting
opportunity to make union operation try harder to make a deterministic
result?

On Wed, May 25, 2016 at 3:07 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> Hmm it's curious you found determinization to be so costly since the
> Levenshtein automata are already determinized.
>
> But, yeah, converting THAT to an FST is tricky...
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Wed, May 25, 2016 at 2:46 PM, Luke Nezda <lnezda@gmail.com> wrote:
>
> > Oof, sounds too tricky for me to justify pursuing right now.  While
> > union'ing 10k Levenshtein automata was tractable, seems determinizing the
> > result is not (NP-hard - oops :)), let alone working out a suitably
> useful
> > conversion to an FST.
> >
> > Thank you very much for input!
> >
> > Kind regards,
> > - Luke
> >
> > On Tue, May 24, 2016 at 6:19 PM, Michael McCandless <
> > lucene@mikemccandless.com> wrote:
> >
> > > In theory if you could construct an FST from your union'd Levenshtein
> > > automata in the right format for SynonymFilter then you could run it
> > using
> > > that FST, and it would give you the start/end tokens for each match,
> and
> > > you'd have the output for each match be the original (unedited) terms.
> > >
> > > But I think information was already lost when you did the initial
> union,
> > > i.e. which original term to output, on which arc (s) in the automaton.
> > >
> > > Also you'd have to at least determinize (and maybe you want to
> minimize)
> > > the union'd automata since FST cannot represent non-deterministic
> > machines.
> > >
> > > Possibly you could determinize, reconstruct the lost information, by
> > > walking the automaton for each original word, intersecting the
> > Levenshtein
> > > automaton for that word, and recording the first arcs you hit that has
> > one
> > > unique original word as its output, and placing outputs on those arcs,
> > and
> > > then doing a "rote" conversion to the syn filter's FST format.  This
> part
> > > sounds tricky :)
> > >
> > > Mike McCandless
> > >
> > > http://blog.mikemccandless.com
> > >
> > > On Tue, May 24, 2016 at 9:07 AM, Luke Nezda <lnezda@gmail.com> wrote:
> > >
> > > > On Tuesday, May 24, 2016, Michael McCandless <
> > lucene@mikemccandless.com
> > > > <javascript:_e(%7B%7D,'cvml','lucene@mikemccandless.com');>>
wrote:
> > > >
> > > > > This sounds ambitious ;)
> > > >
> > > >
> > > > That's why I was hoping for some advice from y'all :)
> > > >
> > > > >
> > > > > Note that Lucene has Levenshtein automata (not FSTs).
> > > >
> > > >
> > > > Right, but I thought it might not be too big a leap.
> > > >
> > > > >
> > > > > Can't you just run an AutomatonQuery on your index once you have
> > > unioned
> > > > > all Levenshtein automata into a single one?
> > > > >
> > > > > Or is the reason that you want to convert to an FST so you can
> > > associate
> > > > > which unedited form(s) a given document matched?
> > > >
> > > >
> > > > Correct, I want the unedited form(s) as well as the match character
> > > offsets
> > > > of each match in each document.
> > > >
> > > > >
> > > > > Mike McCandless
> > > > >
> > > > > http://blog.mikemccandless.com
> > > > >
> > > > > On Mon, May 23, 2016 at 8:59 PM, Luke Nezda <lnezda@gmail.com>
> > wrote:
> > > > >
> > > > > > Hello, all -
> > > > > >
> > > > > > I'd like to use Lucene's automaton/FST code to achieve fast
fuzzy
> > > (OSA
> > > > > edit
> > > > > > distance up to 2) search for many (10k+) strings (knowledge
base:
> > kb)
> > > > in
> > > > > > many large strings (docs).
> > > > > >
> > > > > > Approach I was thinking of: create Levenshtein FST with all
paths
> > > > > > associated with unedited form for each kb key, union all into
> > single
> > > > fst,
> > > > > > search docs for matches in fst in style of SynonymFilter.
> > > > > >
> > > > > > * I created 10k Levenshtein automata from kb keys  and unioned
> > them,
> > > so
> > > > > > that seems tractable (took 1 minute, ~250MB ram)
> > > > > > * SynonymFilter code worked fine to associate output and record
> > match
> > > > > token
> > > > > > length.
> > > > > > * Saw how FuzzySuggester created Levenshtein automata from
> > > query/lookup
> > > > > key
> > > > > > and intersected that with kb-like fst.
> > > > > >
> > > > > > I don't see how to create Levenshtein FSTs (vs automatons)
> > > associating
> > > > > > outputs with unedited form, and union'ing them together.
> > > > > >
> > > > > > Is this a bad idea?  Maybe better idea?
> > > > > >
> > > > > > Thanks in advance,
> > > > > > - Luke
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message