lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Levenshtein FST's?
Date Tue, 24 May 2016 10:24:51 GMT
This sounds ambitious ;)

Note that Lucene has Levenshtein automata (not FSTs).

Can't you just run an AutomatonQuery on your index once you have unioned
all Levenshtein automata into a single one?

Or is the reason that you want to convert to an FST so you can associate
which unedited form(s) a given document matched?

Mike McCandless

http://blog.mikemccandless.com

On Mon, May 23, 2016 at 8:59 PM, Luke Nezda <lnezda@gmail.com> wrote:

> Hello, all -
>
> I'd like to use Lucene's automaton/FST code to achieve fast fuzzy (OSA edit
> distance up to 2) search for many (10k+) strings (knowledge base: kb) in
> many large strings (docs).
>
> Approach I was thinking of: create Levenshtein FST with all paths
> associated with unedited form for each kb key, union all into single fst,
> search docs for matches in fst in style of SynonymFilter.
>
> * I created 10k Levenshtein automata from kb keys  and unioned them, so
> that seems tractable (took 1 minute, ~250MB ram)
> * SynonymFilter code worked fine to associate output and record match token
> length.
> * Saw how FuzzySuggester created Levenshtein automata from query/lookup key
> and intersected that with kb-like fst.
>
> I don't see how to create Levenshtein FSTs (vs automatons) associating
> outputs with unedited form, and union'ing them together.
>
> Is this a bad idea?  Maybe better idea?
>
> Thanks in advance,
> - Luke
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message