lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Luke Nezda <lne...@gmail.com>
Subject Levenshtein FST's?
Date Tue, 24 May 2016 13:07:57 GMT
On Tuesday, May 24, 2016, Michael McCandless <lucene@mikemccandless.com
<javascript:_e(%7B%7D,'cvml','lucene@mikemccandless.com');>> wrote:

> This sounds ambitious ;)


That's why I was hoping for some advice from y'all :)

>
> Note that Lucene has Levenshtein automata (not FSTs).


Right, but I thought it might not be too big a leap.

>
> Can't you just run an AutomatonQuery on your index once you have unioned
> all Levenshtein automata into a single one?
>
> Or is the reason that you want to convert to an FST so you can associate
> which unedited form(s) a given document matched?


Correct, I want the unedited form(s) as well as the match character offsets
of each match in each document.

>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Mon, May 23, 2016 at 8:59 PM, Luke Nezda <lnezda@gmail.com> wrote:
>
> > Hello, all -
> >
> > I'd like to use Lucene's automaton/FST code to achieve fast fuzzy (OSA
> edit
> > distance up to 2) search for many (10k+) strings (knowledge base: kb) in
> > many large strings (docs).
> >
> > Approach I was thinking of: create Levenshtein FST with all paths
> > associated with unedited form for each kb key, union all into single fst,
> > search docs for matches in fst in style of SynonymFilter.
> >
> > * I created 10k Levenshtein automata from kb keys  and unioned them, so
> > that seems tractable (took 1 minute, ~250MB ram)
> > * SynonymFilter code worked fine to associate output and record match
> token
> > length.
> > * Saw how FuzzySuggester created Levenshtein automata from query/lookup
> key
> > and intersected that with kb-like fst.
> >
> > I don't see how to create Levenshtein FSTs (vs automatons) associating
> > outputs with unedited form, and union'ing them together.
> >
> > Is this a bad idea?  Maybe better idea?
> >
> > Thanks in advance,
> > - Luke
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message