lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Luke Nezda <lne...@gmail.com>
Subject Re: Levenshtein FST's?
Date Thu, 26 May 2016 16:08:39 GMT
I should note, I know in I can
call Operations.determinize(union, 10_000_000) but union of 5000+
Levenshtein automata seems to require too many states to be tractable, and
that's on the low end of what I'd like to work with.

On Thu, May 26, 2016 at 9:59 AM, Luke Nezda <lnezda@gmail.com> wrote:

> I was surprised union of deterministic automata wasn't deterministic too,
> but it would appear so:
>
> public void testUnionOfDeterministicsIsDeterministic() {
>   Automaton mud = new LevenshteinAutomata("mud", true).toAutomaton(1);
>   Automaton mad = new LevenshteinAutomata("mad", true).toAutomaton(1);
>   assertTrue(mud.isDeterministic());
>   assertTrue(mad.isDeterministic());
>
>   Automaton union = Operations.union(mud, mad);
>   assertTrue(union.isDeterministic()); // fails
> }
>
>
> Maybe I have a bad embedded assumption? Or maybe there's an interesting
> opportunity to make union operation try harder to make a deterministic
> result?
>
> On Wed, May 25, 2016 at 3:07 PM, Michael McCandless <
> lucene@mikemccandless.com> wrote:
>
>> Hmm it's curious you found determinization to be so costly since the
>> Levenshtein automata are already determinized.
>>
>> But, yeah, converting THAT to an FST is tricky...
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> On Wed, May 25, 2016 at 2:46 PM, Luke Nezda <lnezda@gmail.com> wrote:
>>
>> > Oof, sounds too tricky for me to justify pursuing right now.  While
>> > union'ing 10k Levenshtein automata was tractable, seems determinizing
>> the
>> > result is not (NP-hard - oops :)), let alone working out a suitably
>> useful
>> > conversion to an FST.
>> >
>> > Thank you very much for input!
>> >
>> > Kind regards,
>> > - Luke
>> >
>> > On Tue, May 24, 2016 at 6:19 PM, Michael McCandless <
>> > lucene@mikemccandless.com> wrote:
>> >
>> > > In theory if you could construct an FST from your union'd Levenshtein
>> > > automata in the right format for SynonymFilter then you could run it
>> > using
>> > > that FST, and it would give you the start/end tokens for each match,
>> and
>> > > you'd have the output for each match be the original (unedited) terms.
>> > >
>> > > But I think information was already lost when you did the initial
>> union,
>> > > i.e. which original term to output, on which arc (s) in the automaton.
>> > >
>> > > Also you'd have to at least determinize (and maybe you want to
>> minimize)
>> > > the union'd automata since FST cannot represent non-deterministic
>> > machines.
>> > >
>> > > Possibly you could determinize, reconstruct the lost information, by
>> > > walking the automaton for each original word, intersecting the
>> > Levenshtein
>> > > automaton for that word, and recording the first arcs you hit that has
>> > one
>> > > unique original word as its output, and placing outputs on those arcs,
>> > and
>> > > then doing a "rote" conversion to the syn filter's FST format.  This
>> part
>> > > sounds tricky :)
>> > >
>> > > Mike McCandless
>> > >
>> > > http://blog.mikemccandless.com
>> > >
>> > > On Tue, May 24, 2016 at 9:07 AM, Luke Nezda <lnezda@gmail.com> wrote:
>> > >
>> > > > On Tuesday, May 24, 2016, Michael McCandless <
>> > lucene@mikemccandless.com
>> > > > <javascript:_e(%7B%7D,'cvml','lucene@mikemccandless.com');>>
wrote:
>> > > >
>> > > > > This sounds ambitious ;)
>> > > >
>> > > >
>> > > > That's why I was hoping for some advice from y'all :)
>> > > >
>> > > > >
>> > > > > Note that Lucene has Levenshtein automata (not FSTs).
>> > > >
>> > > >
>> > > > Right, but I thought it might not be too big a leap.
>> > > >
>> > > > >
>> > > > > Can't you just run an AutomatonQuery on your index once you have
>> > > unioned
>> > > > > all Levenshtein automata into a single one?
>> > > > >
>> > > > > Or is the reason that you want to convert to an FST so you can
>> > > associate
>> > > > > which unedited form(s) a given document matched?
>> > > >
>> > > >
>> > > > Correct, I want the unedited form(s) as well as the match character
>> > > offsets
>> > > > of each match in each document.
>> > > >
>> > > > >
>> > > > > Mike McCandless
>> > > > >
>> > > > > http://blog.mikemccandless.com
>> > > > >
>> > > > > On Mon, May 23, 2016 at 8:59 PM, Luke Nezda <lnezda@gmail.com>
>> > wrote:
>> > > > >
>> > > > > > Hello, all -
>> > > > > >
>> > > > > > I'd like to use Lucene's automaton/FST code to achieve fast
>> fuzzy
>> > > (OSA
>> > > > > edit
>> > > > > > distance up to 2) search for many (10k+) strings (knowledge
>> base:
>> > kb)
>> > > > in
>> > > > > > many large strings (docs).
>> > > > > >
>> > > > > > Approach I was thinking of: create Levenshtein FST with
all
>> paths
>> > > > > > associated with unedited form for each kb key, union all
into
>> > single
>> > > > fst,
>> > > > > > search docs for matches in fst in style of SynonymFilter.
>> > > > > >
>> > > > > > * I created 10k Levenshtein automata from kb keys  and unioned
>> > them,
>> > > so
>> > > > > > that seems tractable (took 1 minute, ~250MB ram)
>> > > > > > * SynonymFilter code worked fine to associate output and
record
>> > match
>> > > > > token
>> > > > > > length.
>> > > > > > * Saw how FuzzySuggester created Levenshtein automata from
>> > > query/lookup
>> > > > > key
>> > > > > > and intersected that with kb-like fst.
>> > > > > >
>> > > > > > I don't see how to create Levenshtein FSTs (vs automatons)
>> > > associating
>> > > > > > outputs with unedited form, and union'ing them together.
>> > > > > >
>> > > > > > Is this a bad idea?  Maybe better idea?
>> > > > > >
>> > > > > > Thanks in advance,
>> > > > > > - Luke
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message