lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Levenshtein FST's?
Date Wed, 25 May 2016 20:07:39 GMT
Hmm it's curious you found determinization to be so costly since the
Levenshtein automata are already determinized.

But, yeah, converting THAT to an FST is tricky...

Mike McCandless

http://blog.mikemccandless.com

On Wed, May 25, 2016 at 2:46 PM, Luke Nezda <lnezda@gmail.com> wrote:

> Oof, sounds too tricky for me to justify pursuing right now.  While
> union'ing 10k Levenshtein automata was tractable, seems determinizing the
> result is not (NP-hard - oops :)), let alone working out a suitably useful
> conversion to an FST.
>
> Thank you very much for input!
>
> Kind regards,
> - Luke
>
> On Tue, May 24, 2016 at 6:19 PM, Michael McCandless <
> lucene@mikemccandless.com> wrote:
>
> > In theory if you could construct an FST from your union'd Levenshtein
> > automata in the right format for SynonymFilter then you could run it
> using
> > that FST, and it would give you the start/end tokens for each match, and
> > you'd have the output for each match be the original (unedited) terms.
> >
> > But I think information was already lost when you did the initial union,
> > i.e. which original term to output, on which arc (s) in the automaton.
> >
> > Also you'd have to at least determinize (and maybe you want to minimize)
> > the union'd automata since FST cannot represent non-deterministic
> machines.
> >
> > Possibly you could determinize, reconstruct the lost information, by
> > walking the automaton for each original word, intersecting the
> Levenshtein
> > automaton for that word, and recording the first arcs you hit that has
> one
> > unique original word as its output, and placing outputs on those arcs,
> and
> > then doing a "rote" conversion to the syn filter's FST format.  This part
> > sounds tricky :)
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> > On Tue, May 24, 2016 at 9:07 AM, Luke Nezda <lnezda@gmail.com> wrote:
> >
> > > On Tuesday, May 24, 2016, Michael McCandless <
> lucene@mikemccandless.com
> > > <javascript:_e(%7B%7D,'cvml','lucene@mikemccandless.com');>> wrote:
> > >
> > > > This sounds ambitious ;)
> > >
> > >
> > > That's why I was hoping for some advice from y'all :)
> > >
> > > >
> > > > Note that Lucene has Levenshtein automata (not FSTs).
> > >
> > >
> > > Right, but I thought it might not be too big a leap.
> > >
> > > >
> > > > Can't you just run an AutomatonQuery on your index once you have
> > unioned
> > > > all Levenshtein automata into a single one?
> > > >
> > > > Or is the reason that you want to convert to an FST so you can
> > associate
> > > > which unedited form(s) a given document matched?
> > >
> > >
> > > Correct, I want the unedited form(s) as well as the match character
> > offsets
> > > of each match in each document.
> > >
> > > >
> > > > Mike McCandless
> > > >
> > > > http://blog.mikemccandless.com
> > > >
> > > > On Mon, May 23, 2016 at 8:59 PM, Luke Nezda <lnezda@gmail.com>
> wrote:
> > > >
> > > > > Hello, all -
> > > > >
> > > > > I'd like to use Lucene's automaton/FST code to achieve fast fuzzy
> > (OSA
> > > > edit
> > > > > distance up to 2) search for many (10k+) strings (knowledge base:
> kb)
> > > in
> > > > > many large strings (docs).
> > > > >
> > > > > Approach I was thinking of: create Levenshtein FST with all paths
> > > > > associated with unedited form for each kb key, union all into
> single
> > > fst,
> > > > > search docs for matches in fst in style of SynonymFilter.
> > > > >
> > > > > * I created 10k Levenshtein automata from kb keys  and unioned
> them,
> > so
> > > > > that seems tractable (took 1 minute, ~250MB ram)
> > > > > * SynonymFilter code worked fine to associate output and record
> match
> > > > token
> > > > > length.
> > > > > * Saw how FuzzySuggester created Levenshtein automata from
> > query/lookup
> > > > key
> > > > > and intersected that with kb-like fst.
> > > > >
> > > > > I don't see how to create Levenshtein FSTs (vs automatons)
> > associating
> > > > > outputs with unedited form, and union'ing them together.
> > > > >
> > > > > Is this a bad idea?  Maybe better idea?
> > > > >
> > > > > Thanks in advance,
> > > > > - Luke
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message