lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Luke Nezda <lne...@gmail.com>
Subject Re: Levenshtein FST's?
Date Fri, 27 May 2016 21:02:52 GMT
Point taken, but I wonder if there's an algorithmic shortcut to determinize
the union of Levenshtein DFAs...

Here's some stats on unioning n Levenshtein automata (2-edit, with
transpositions, of names) together (not attempting to determinize them):
* n=1000 numStates: 337,553 numTransitions: 1,577,237 (Xmx1g OK)
* n=5000 numStates: 1,620,944 numTransitions: 7,669,610 (Xmx1g fails, Xmx2g
OK)
* n=10000 numStates: 3,248,802 numTransitions: 15,377,814 (Xmx2g fails,
Xmx3g OK)

On Thu, May 26, 2016 at 12:13 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> But how many states does the not-yet-determinized union of 5000+
> Levenshtein automata contain?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Thu, May 26, 2016 at 12:08 PM, Luke Nezda <lnezda@gmail.com> wrote:
>
> > I should note, I know in I can
> > call Operations.determinize(union, 10_000_000) but union of 5000+
> > Levenshtein automata seems to require too many states to be tractable,
> and
> > that's on the low end of what I'd like to work with.
> >
> > On Thu, May 26, 2016 at 9:59 AM, Luke Nezda <lnezda@gmail.com> wrote:
> >
> > > I was surprised union of deterministic automata wasn't deterministic
> too,
> > > but it would appear so:
> > >
> > > public void testUnionOfDeterministicsIsDeterministic() {
> > >   Automaton mud = new LevenshteinAutomata("mud", true).toAutomaton(1);
> > >   Automaton mad = new LevenshteinAutomata("mad", true).toAutomaton(1);
> > >   assertTrue(mud.isDeterministic());
> > >   assertTrue(mad.isDeterministic());
> > >
> > >   Automaton union = Operations.union(mud, mad);
> > >   assertTrue(union.isDeterministic()); // fails
> > > }
> > >
> > >
> > > Maybe I have a bad embedded assumption? Or maybe there's an interesting
> > > opportunity to make union operation try harder to make a deterministic
> > > result?
> > >
> > > On Wed, May 25, 2016 at 3:07 PM, Michael McCandless <
> > > lucene@mikemccandless.com> wrote:
> > >
> > >> Hmm it's curious you found determinization to be so costly since the
> > >> Levenshtein automata are already determinized.
> > >>
> > >> But, yeah, converting THAT to an FST is tricky...
> > >>
> > >> Mike McCandless
> > >>
> > >> http://blog.mikemccandless.com
> > >>
> > >> On Wed, May 25, 2016 at 2:46 PM, Luke Nezda <lnezda@gmail.com> wrote:
> > >>
> > >> > Oof, sounds too tricky for me to justify pursuing right now.  While
> > >> > union'ing 10k Levenshtein automata was tractable, seems
> determinizing
> > >> the
> > >> > result is not (NP-hard - oops :)), let alone working out a suitably
> > >> useful
> > >> > conversion to an FST.
> > >> >
> > >> > Thank you very much for input!
> > >> >
> > >> > Kind regards,
> > >> > - Luke
> > >> >
> > >> > On Tue, May 24, 2016 at 6:19 PM, Michael McCandless <
> > >> > lucene@mikemccandless.com> wrote:
> > >> >
> > >> > > In theory if you could construct an FST from your union'd
> > Levenshtein
> > >> > > automata in the right format for SynonymFilter then you could
run
> it
> > >> > using
> > >> > > that FST, and it would give you the start/end tokens for each
> match,
> > >> and
> > >> > > you'd have the output for each match be the original (unedited)
> > terms.
> > >> > >
> > >> > > But I think information was already lost when you did the initial
> > >> union,
> > >> > > i.e. which original term to output, on which arc (s) in the
> > automaton.
> > >> > >
> > >> > > Also you'd have to at least determinize (and maybe you want to
> > >> minimize)
> > >> > > the union'd automata since FST cannot represent non-deterministic
> > >> > machines.
> > >> > >
> > >> > > Possibly you could determinize, reconstruct the lost information,
> by
> > >> > > walking the automaton for each original word, intersecting the
> > >> > Levenshtein
> > >> > > automaton for that word, and recording the first arcs you hit
that
> > has
> > >> > one
> > >> > > unique original word as its output, and placing outputs on those
> > arcs,
> > >> > and
> > >> > > then doing a "rote" conversion to the syn filter's FST format.
> This
> > >> part
> > >> > > sounds tricky :)
> > >> > >
> > >> > > Mike McCandless
> > >> > >
> > >> > > http://blog.mikemccandless.com
> > >> > >
> > >> > > On Tue, May 24, 2016 at 9:07 AM, Luke Nezda <lnezda@gmail.com>
> > wrote:
> > >> > >
> > >> > > > On Tuesday, May 24, 2016, Michael McCandless <
> > >> > lucene@mikemccandless.com
> > >> > > > <javascript:_e(%7B%7D,'cvml','lucene@mikemccandless.com');>>
> > wrote:
> > >> > > >
> > >> > > > > This sounds ambitious ;)
> > >> > > >
> > >> > > >
> > >> > > > That's why I was hoping for some advice from y'all :)
> > >> > > >
> > >> > > > >
> > >> > > > > Note that Lucene has Levenshtein automata (not FSTs).
> > >> > > >
> > >> > > >
> > >> > > > Right, but I thought it might not be too big a leap.
> > >> > > >
> > >> > > > >
> > >> > > > > Can't you just run an AutomatonQuery on your index
once you
> have
> > >> > > unioned
> > >> > > > > all Levenshtein automata into a single one?
> > >> > > > >
> > >> > > > > Or is the reason that you want to convert to an FST
so you can
> > >> > > associate
> > >> > > > > which unedited form(s) a given document matched?
> > >> > > >
> > >> > > >
> > >> > > > Correct, I want the unedited form(s) as well as the match
> > character
> > >> > > offsets
> > >> > > > of each match in each document.
> > >> > > >
> > >> > > > >
> > >> > > > > Mike McCandless
> > >> > > > >
> > >> > > > > http://blog.mikemccandless.com
> > >> > > > >
> > >> > > > > On Mon, May 23, 2016 at 8:59 PM, Luke Nezda <lnezda@gmail.com
> >
> > >> > wrote:
> > >> > > > >
> > >> > > > > > Hello, all -
> > >> > > > > >
> > >> > > > > > I'd like to use Lucene's automaton/FST code to
achieve fast
> > >> fuzzy
> > >> > > (OSA
> > >> > > > > edit
> > >> > > > > > distance up to 2) search for many (10k+) strings
(knowledge
> > >> base:
> > >> > kb)
> > >> > > > in
> > >> > > > > > many large strings (docs).
> > >> > > > > >
> > >> > > > > > Approach I was thinking of: create Levenshtein
FST with all
> > >> paths
> > >> > > > > > associated with unedited form for each kb key,
union all
> into
> > >> > single
> > >> > > > fst,
> > >> > > > > > search docs for matches in fst in style of SynonymFilter.
> > >> > > > > >
> > >> > > > > > * I created 10k Levenshtein automata from kb keys
 and
> unioned
> > >> > them,
> > >> > > so
> > >> > > > > > that seems tractable (took 1 minute, ~250MB ram)
> > >> > > > > > * SynonymFilter code worked fine to associate
output and
> > record
> > >> > match
> > >> > > > > token
> > >> > > > > > length.
> > >> > > > > > * Saw how FuzzySuggester created Levenshtein automata
from
> > >> > > query/lookup
> > >> > > > > key
> > >> > > > > > and intersected that with kb-like fst.
> > >> > > > > >
> > >> > > > > > I don't see how to create Levenshtein FSTs (vs
automatons)
> > >> > > associating
> > >> > > > > > outputs with unedited form, and union'ing them
together.
> > >> > > > > >
> > >> > > > > > Is this a bad idea?  Maybe better idea?
> > >> > > > > >
> > >> > > > > > Thanks in advance,
> > >> > > > > > - Luke
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message