lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Luke Nezda <lne...@gmail.com>
Subject Re: Levenshtein FST's?
Date Wed, 25 May 2016 18:46:38 GMT
Oof, sounds too tricky for me to justify pursuing right now.  While
union'ing 10k Levenshtein automata was tractable, seems determinizing the
result is not (NP-hard - oops :)), let alone working out a suitably useful
conversion to an FST.

Thank you very much for input!

Kind regards,
- Luke

On Tue, May 24, 2016 at 6:19 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> In theory if you could construct an FST from your union'd Levenshtein
> automata in the right format for SynonymFilter then you could run it using
> that FST, and it would give you the start/end tokens for each match, and
> you'd have the output for each match be the original (unedited) terms.
>
> But I think information was already lost when you did the initial union,
> i.e. which original term to output, on which arc (s) in the automaton.
>
> Also you'd have to at least determinize (and maybe you want to minimize)
> the union'd automata since FST cannot represent non-deterministic machines.
>
> Possibly you could determinize, reconstruct the lost information, by
> walking the automaton for each original word, intersecting the Levenshtein
> automaton for that word, and recording the first arcs you hit that has one
> unique original word as its output, and placing outputs on those arcs, and
> then doing a "rote" conversion to the syn filter's FST format.  This part
> sounds tricky :)
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Tue, May 24, 2016 at 9:07 AM, Luke Nezda <lnezda@gmail.com> wrote:
>
> > On Tuesday, May 24, 2016, Michael McCandless <lucene@mikemccandless.com
> > <javascript:_e(%7B%7D,'cvml','lucene@mikemccandless.com');>> wrote:
> >
> > > This sounds ambitious ;)
> >
> >
> > That's why I was hoping for some advice from y'all :)
> >
> > >
> > > Note that Lucene has Levenshtein automata (not FSTs).
> >
> >
> > Right, but I thought it might not be too big a leap.
> >
> > >
> > > Can't you just run an AutomatonQuery on your index once you have
> unioned
> > > all Levenshtein automata into a single one?
> > >
> > > Or is the reason that you want to convert to an FST so you can
> associate
> > > which unedited form(s) a given document matched?
> >
> >
> > Correct, I want the unedited form(s) as well as the match character
> offsets
> > of each match in each document.
> >
> > >
> > > Mike McCandless
> > >
> > > http://blog.mikemccandless.com
> > >
> > > On Mon, May 23, 2016 at 8:59 PM, Luke Nezda <lnezda@gmail.com> wrote:
> > >
> > > > Hello, all -
> > > >
> > > > I'd like to use Lucene's automaton/FST code to achieve fast fuzzy
> (OSA
> > > edit
> > > > distance up to 2) search for many (10k+) strings (knowledge base: kb)
> > in
> > > > many large strings (docs).
> > > >
> > > > Approach I was thinking of: create Levenshtein FST with all paths
> > > > associated with unedited form for each kb key, union all into single
> > fst,
> > > > search docs for matches in fst in style of SynonymFilter.
> > > >
> > > > * I created 10k Levenshtein automata from kb keys  and unioned them,
> so
> > > > that seems tractable (took 1 minute, ~250MB ram)
> > > > * SynonymFilter code worked fine to associate output and record match
> > > token
> > > > length.
> > > > * Saw how FuzzySuggester created Levenshtein automata from
> query/lookup
> > > key
> > > > and intersected that with kb-like fst.
> > > >
> > > > I don't see how to create Levenshtein FSTs (vs automatons)
> associating
> > > > outputs with unedited form, and union'ing them together.
> > > >
> > > > Is this a bad idea?  Maybe better idea?
> > > >
> > > > Thanks in advance,
> > > > - Luke
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message