lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Juan Caicedo <juan.caic...@cavorite.com>
Subject Re: New feature idea - Backwards (FST) dictionary for approximate string search
Date Tue, 09 Jul 2019 05:43:11 GMT
Hi Michael,

I guess that I should have added more details :-)

The main benefit from the technique is to do approximate search in
large dictionaries. That is: find all the entries that are within 1 or
2 edit steps from the query. It's more useful for longer (starting
from ~7 characters, iirc), but in general the technique can be applied
to queries of any length. The main requirement is that we build the
dictionary as an FST, using the backwards (reversed) keys of the
original dictionary.

I initially used the technique to implement a stand-alone
spellchecker, but I think that it can also be used to optimize the
Lucene fuzzy queries (e.g. for the spelling/suggest module). However,
I'll need to look how can it be integrated with the part that creates
the dictionary.

I'll take a look at the code this week and I'll try to publish it in a
public repository so that we can discuss about this with more concrete
details.

I skimmed the paper trying to understand possible applications of the
technique. It sounds like efficient approximate (ie with some edits)
substring search is the main idea? I don't believe such a query exists
today in Lucene (nor any Suggester as far as I know). It sounds as if
this would be useful for searching within large strings, eg DNA
sequences or something like that, and maybe less applicable to typical
"full text" (ie tokenized) search where the strings being searched are
relatively shorter - does that sound right?

On Sat, Jul 6, 2019 at 2:01 PM Michael Sokolov <msokolov@gmail.com> wrote:
>
> Juan, that sounds intriguing.
>
> I skimmed the paper trying to understand possible applications of the
> technique. It sounds like efficient approximate (ie with some edits)
> substring search is the main idea? I don't believe such a query exists
> today in Lucene (nor any Suggester as far as I know). It sounds as if
> this would be useful for searching within large strings, eg DNA
> sequences or something like that, and maybe less applicable to typical
> "full text" (ie tokenized) search where the strings being searched are
> relatively shorter - does that sound right?
>
> On Sat, Jul 6, 2019 at 12:35 PM Juan Caicedo <juan.caicedo@cavorite.com> wrote:
> >
> > Hello,
> >
> > I've been working on a project for extending LevenshteinAutomata and
> > I'd like to know if it would be useful to add it to Lucene.
> >
> > I've implemented the 'backwards dictionary' technique (see [1],
> > section 6) for speeding up approximate search. This technique allows
> > us to narrow down the search and, therefore, reduce the running time
> > (at the expense of using more memory).
> >
> > I implemented it quite some time ago using an older version of Lucene,
> > so I need to revisit the code. However, the implementation was
> > relatively simple and it didn't require major changes to the core
> > classes. I can share the code in a public repository and iterate on
> > it, while I make it compatible for new Lucene APIs, add benchmarks,
> > and more unit tests.
> >
> > Ideally, I'd like to contribute to Lucene, either as part of core,
> > suggest or a different module.
> >
> > What do you think?
> >
> > [1] https://www.cis.uni-muenchen.de/download/publikationen/fastapproxsearch.pdf
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: dev-help@lucene.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message