lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Levenshtein FST's?
Date Thu, 26 May 2016 17:13:08 GMT
But how many states does the not-yet-determinized union of 5000+
Levenshtein automata contain?

Mike McCandless

http://blog.mikemccandless.com

On Thu, May 26, 2016 at 12:08 PM, Luke Nezda <lnezda@gmail.com> wrote:

> I should note, I know in I can
> call Operations.determinize(union, 10_000_000) but union of 5000+
> Levenshtein automata seems to require too many states to be tractable, and
> that's on the low end of what I'd like to work with.
>
> On Thu, May 26, 2016 at 9:59 AM, Luke Nezda <lnezda@gmail.com> wrote:
>
> > I was surprised union of deterministic automata wasn't deterministic too,
> > but it would appear so:
> >
> > public void testUnionOfDeterministicsIsDeterministic() {
> >   Automaton mud = new LevenshteinAutomata("mud", true).toAutomaton(1);
> >   Automaton mad = new LevenshteinAutomata("mad", true).toAutomaton(1);
> >   assertTrue(mud.isDeterministic());
> >   assertTrue(mad.isDeterministic());
> >
> >   Automaton union = Operations.union(mud, mad);
> >   assertTrue(union.isDeterministic()); // fails
> > }
> >
> >
> > Maybe I have a bad embedded assumption? Or maybe there's an interesting
> > opportunity to make union operation try harder to make a deterministic
> > result?
> >
> > On Wed, May 25, 2016 at 3:07 PM, Michael McCandless <
> > lucene@mikemccandless.com> wrote:
> >
> >> Hmm it's curious you found determinization to be so costly since the
> >> Levenshtein automata are already determinized.
> >>
> >> But, yeah, converting THAT to an FST is tricky...
> >>
> >> Mike McCandless
> >>
> >> http://blog.mikemccandless.com
> >>
> >> On Wed, May 25, 2016 at 2:46 PM, Luke Nezda <lnezda@gmail.com> wrote:
> >>
> >> > Oof, sounds too tricky for me to justify pursuing right now.  While
> >> > union'ing 10k Levenshtein automata was tractable, seems determinizing
> >> the
> >> > result is not (NP-hard - oops :)), let alone working out a suitably
> >> useful
> >> > conversion to an FST.
> >> >
> >> > Thank you very much for input!
> >> >
> >> > Kind regards,
> >> > - Luke
> >> >
> >> > On Tue, May 24, 2016 at 6:19 PM, Michael McCandless <
> >> > lucene@mikemccandless.com> wrote:
> >> >
> >> > > In theory if you could construct an FST from your union'd
> Levenshtein
> >> > > automata in the right format for SynonymFilter then you could run
it
> >> > using
> >> > > that FST, and it would give you the start/end tokens for each match,
> >> and
> >> > > you'd have the output for each match be the original (unedited)
> terms.
> >> > >
> >> > > But I think information was already lost when you did the initial
> >> union,
> >> > > i.e. which original term to output, on which arc (s) in the
> automaton.
> >> > >
> >> > > Also you'd have to at least determinize (and maybe you want to
> >> minimize)
> >> > > the union'd automata since FST cannot represent non-deterministic
> >> > machines.
> >> > >
> >> > > Possibly you could determinize, reconstruct the lost information,
by
> >> > > walking the automaton for each original word, intersecting the
> >> > Levenshtein
> >> > > automaton for that word, and recording the first arcs you hit that
> has
> >> > one
> >> > > unique original word as its output, and placing outputs on those
> arcs,
> >> > and
> >> > > then doing a "rote" conversion to the syn filter's FST format.  This
> >> part
> >> > > sounds tricky :)
> >> > >
> >> > > Mike McCandless
> >> > >
> >> > > http://blog.mikemccandless.com
> >> > >
> >> > > On Tue, May 24, 2016 at 9:07 AM, Luke Nezda <lnezda@gmail.com>
> wrote:
> >> > >
> >> > > > On Tuesday, May 24, 2016, Michael McCandless <
> >> > lucene@mikemccandless.com
> >> > > > <javascript:_e(%7B%7D,'cvml','lucene@mikemccandless.com');>>
> wrote:
> >> > > >
> >> > > > > This sounds ambitious ;)
> >> > > >
> >> > > >
> >> > > > That's why I was hoping for some advice from y'all :)
> >> > > >
> >> > > > >
> >> > > > > Note that Lucene has Levenshtein automata (not FSTs).
> >> > > >
> >> > > >
> >> > > > Right, but I thought it might not be too big a leap.
> >> > > >
> >> > > > >
> >> > > > > Can't you just run an AutomatonQuery on your index once
you have
> >> > > unioned
> >> > > > > all Levenshtein automata into a single one?
> >> > > > >
> >> > > > > Or is the reason that you want to convert to an FST so you
can
> >> > > associate
> >> > > > > which unedited form(s) a given document matched?
> >> > > >
> >> > > >
> >> > > > Correct, I want the unedited form(s) as well as the match
> character
> >> > > offsets
> >> > > > of each match in each document.
> >> > > >
> >> > > > >
> >> > > > > Mike McCandless
> >> > > > >
> >> > > > > http://blog.mikemccandless.com
> >> > > > >
> >> > > > > On Mon, May 23, 2016 at 8:59 PM, Luke Nezda <lnezda@gmail.com>
> >> > wrote:
> >> > > > >
> >> > > > > > Hello, all -
> >> > > > > >
> >> > > > > > I'd like to use Lucene's automaton/FST code to achieve
fast
> >> fuzzy
> >> > > (OSA
> >> > > > > edit
> >> > > > > > distance up to 2) search for many (10k+) strings (knowledge
> >> base:
> >> > kb)
> >> > > > in
> >> > > > > > many large strings (docs).
> >> > > > > >
> >> > > > > > Approach I was thinking of: create Levenshtein FST
with all
> >> paths
> >> > > > > > associated with unedited form for each kb key, union
all into
> >> > single
> >> > > > fst,
> >> > > > > > search docs for matches in fst in style of SynonymFilter.
> >> > > > > >
> >> > > > > > * I created 10k Levenshtein automata from kb keys 
and unioned
> >> > them,
> >> > > so
> >> > > > > > that seems tractable (took 1 minute, ~250MB ram)
> >> > > > > > * SynonymFilter code worked fine to associate output
and
> record
> >> > match
> >> > > > > token
> >> > > > > > length.
> >> > > > > > * Saw how FuzzySuggester created Levenshtein automata
from
> >> > > query/lookup
> >> > > > > key
> >> > > > > > and intersected that with kb-like fst.
> >> > > > > >
> >> > > > > > I don't see how to create Levenshtein FSTs (vs automatons)
> >> > > associating
> >> > > > > > outputs with unedited form, and union'ing them together.
> >> > > > > >
> >> > > > > > Is this a bad idea?  Maybe better idea?
> >> > > > > >
> >> > > > > > Thanks in advance,
> >> > > > > > - Luke
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message