lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Levenshtein FST's?
Date Tue, 24 May 2016 23:19:09 GMT
In theory if you could construct an FST from your union'd Levenshtein
automata in the right format for SynonymFilter then you could run it using
that FST, and it would give you the start/end tokens for each match, and
you'd have the output for each match be the original (unedited) terms.

But I think information was already lost when you did the initial union,
i.e. which original term to output, on which arc (s) in the automaton.

Also you'd have to at least determinize (and maybe you want to minimize)
the union'd automata since FST cannot represent non-deterministic machines.

Possibly you could determinize, reconstruct the lost information, by
walking the automaton for each original word, intersecting the Levenshtein
automaton for that word, and recording the first arcs you hit that has one
unique original word as its output, and placing outputs on those arcs, and
then doing a "rote" conversion to the syn filter's FST format.  This part
sounds tricky :)

Mike McCandless

http://blog.mikemccandless.com

On Tue, May 24, 2016 at 9:07 AM, Luke Nezda <lnezda@gmail.com> wrote:

> On Tuesday, May 24, 2016, Michael McCandless <lucene@mikemccandless.com
> <javascript:_e(%7B%7D,'cvml','lucene@mikemccandless.com');>> wrote:
>
> > This sounds ambitious ;)
>
>
> That's why I was hoping for some advice from y'all :)
>
> >
> > Note that Lucene has Levenshtein automata (not FSTs).
>
>
> Right, but I thought it might not be too big a leap.
>
> >
> > Can't you just run an AutomatonQuery on your index once you have unioned
> > all Levenshtein automata into a single one?
> >
> > Or is the reason that you want to convert to an FST so you can associate
> > which unedited form(s) a given document matched?
>
>
> Correct, I want the unedited form(s) as well as the match character offsets
> of each match in each document.
>
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> > On Mon, May 23, 2016 at 8:59 PM, Luke Nezda <lnezda@gmail.com> wrote:
> >
> > > Hello, all -
> > >
> > > I'd like to use Lucene's automaton/FST code to achieve fast fuzzy (OSA
> > edit
> > > distance up to 2) search for many (10k+) strings (knowledge base: kb)
> in
> > > many large strings (docs).
> > >
> > > Approach I was thinking of: create Levenshtein FST with all paths
> > > associated with unedited form for each kb key, union all into single
> fst,
> > > search docs for matches in fst in style of SynonymFilter.
> > >
> > > * I created 10k Levenshtein automata from kb keys  and unioned them, so
> > > that seems tractable (took 1 minute, ~250MB ram)
> > > * SynonymFilter code worked fine to associate output and record match
> > token
> > > length.
> > > * Saw how FuzzySuggester created Levenshtein automata from query/lookup
> > key
> > > and intersected that with kb-like fst.
> > >
> > > I don't see how to create Levenshtein FSTs (vs automatons) associating
> > > outputs with unedited form, and union'ing them together.
> > >
> > > Is this a bad idea?  Maybe better idea?
> > >
> > > Thanks in advance,
> > > - Luke
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message