lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <>
Subject [jira] Updated: (LUCENE-2507) automaton spellchecker
Date Tue, 28 Sep 2010 17:45:35 GMT


Robert Muir updated LUCENE-2507:

    Attachment: LUCENE-2507.patch

we have sped up this seeking a lot recently, and i improved this patch some:
* avoid calling docfreq on the suggestions, by using the TermsEnum's docfreq
* Mike had the idea that we should actually try lower edit distances first. The 
  general use case here is a small number of suggestions (e.g. 1), so 
  we actually try edit distance=1 first... only if this doesn't give enough suggestions 
  do we then try higher distances. 

I think this is a good approach here, because we are getting levenshtein directly, 
so we don't have the problem the n-gram based spellchecker has... (for reference below)

   * <p>As the Lucene similarity that is used to fetch the most relevant n-grammed terms
   * is not the same as the edit distance strategy used to calculate the best
   * matching spell-checked word from the hits that Lucene found, one usually has
   * to retrieve a couple of numSug's in order to get the true best match.
   * <p>I.e. if numSug == 1, don't count on that suggestion being the best one.
   * Thus, you should set this value to <b>at least</b> 5 for a good suggestion.

Since we are actually doing levenshtein, you can safely use lower values for numSug,
such as numSug=1

> automaton spellchecker
> ----------------------
>                 Key: LUCENE-2507
>                 URL:
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/spellchecker
>            Reporter: Robert Muir
>             Fix For: 4.0
>         Attachments: LUCENE-2507.patch, LUCENE-2507.patch, LUCENE-2507.patch
> The current spellchecker makes an n-gram index of your terms, and queries this for spellchecking.
> The terms that come back from the n-gram query are then re-ranked by an algorithm such
as Levenshtein.
> Alternatively, we could just do a levenshtein query directly against the index, then
we wouldn't need
> a separate index to rebuild.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message