lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Juan Caicedo <juan.caic...@cavorite.com>
Subject Re: New feature idea - Backwards (FST) dictionary for approximate string search
Date Wed, 10 Jul 2019 14:40:19 GMT
I didn’t integrate it directly with a Lucene index. I used the FST class
(and related utilities) to build a stand-alone spellchecker.

However, the FST for the term dictionary is modified at index time only. At
query time, we need to create a variation of the Levemshtein automata and
match it with the FST accordingly.

I need to take a closer look at the Lucene code to see how can we integrate
it with the existing modules. I’ll do that this week.

On Tue, Jul 9, 2019 at 03:18 Michael McCandless <lucene@mikemccandless.com>
wrote:

> This sounds interesting!
>
> Did your patch build the FST at index time, so at search time all that was
> needed was to load the FST from the index directory?
>
> Is the FST per-segment?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Tue, Jul 9, 2019 at 1:43 AM Juan Caicedo <juan.caicedo@cavorite.com>
> wrote:
>
>> Hi Michael,
>>
>> I guess that I should have added more details :-)
>>
>> The main benefit from the technique is to do approximate search in
>> large dictionaries. That is: find all the entries that are within 1 or
>> 2 edit steps from the query. It's more useful for longer (starting
>> from ~7 characters, iirc), but in general the technique can be applied
>> to queries of any length. The main requirement is that we build the
>> dictionary as an FST, using the backwards (reversed) keys of the
>> original dictionary.
>>
>> I initially used the technique to implement a stand-alone
>> spellchecker, but I think that it can also be used to optimize the
>> Lucene fuzzy queries (e.g. for the spelling/suggest module). However,
>> I'll need to look how can it be integrated with the part that creates
>> the dictionary.
>>
>> I'll take a look at the code this week and I'll try to publish it in a
>> public repository so that we can discuss about this with more concrete
>> details.
>>
>> I skimmed the paper trying to understand possible applications of the
>> technique. It sounds like efficient approximate (ie with some edits)
>> substring search is the main idea? I don't believe such a query exists
>> today in Lucene (nor any Suggester as far as I know). It sounds as if
>> this would be useful for searching within large strings, eg DNA
>> sequences or something like that, and maybe less applicable to typical
>> "full text" (ie tokenized) search where the strings being searched are
>> relatively shorter - does that sound right?
>>
>> On Sat, Jul 6, 2019 at 2:01 PM Michael Sokolov <msokolov@gmail.com>
>> wrote:
>> >
>> > Juan, that sounds intriguing.
>> >
>> > I skimmed the paper trying to understand possible applications of the
>> > technique. It sounds like efficient approximate (ie with some edits)
>> > substring search is the main idea? I don't believe such a query exists
>> > today in Lucene (nor any Suggester as far as I know). It sounds as if
>> > this would be useful for searching within large strings, eg DNA
>> > sequences or something like that, and maybe less applicable to typical
>> > "full text" (ie tokenized) search where the strings being searched are
>> > relatively shorter - does that sound right?
>> >
>> > On Sat, Jul 6, 2019 at 12:35 PM Juan Caicedo <juan.caicedo@cavorite.com>
>> wrote:
>> > >
>> > > Hello,
>> > >
>> > > I've been working on a project for extending LevenshteinAutomata and
>> > > I'd like to know if it would be useful to add it to Lucene.
>> > >
>> > > I've implemented the 'backwards dictionary' technique (see [1],
>> > > section 6) for speeding up approximate search. This technique allows
>> > > us to narrow down the search and, therefore, reduce the running time
>> > > (at the expense of using more memory).
>> > >
>> > > I implemented it quite some time ago using an older version of Lucene,
>> > > so I need to revisit the code. However, the implementation was
>> > > relatively simple and it didn't require major changes to the core
>> > > classes. I can share the code in a public repository and iterate on
>> > > it, while I make it compatible for new Lucene APIs, add benchmarks,
>> > > and more unit tests.
>> > >
>> > > Ideally, I'd like to contribute to Lucene, either as part of core,
>> > > suggest or a different module.
>> > >
>> > > What do you think?
>> > >
>> > > [1]
>> https://www.cis.uni-muenchen.de/download/publikationen/fastapproxsearch.pdf
>> > >
>> > > ---------------------------------------------------------------------
>> > > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> > > For additional commands, e-mail: dev-help@lucene.apache.org
>> > >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: dev-help@lucene.apache.org
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>

Mime
View raw message