lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Spencer <>
Subject Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene
Date Wed, 15 Sep 2004 16:51:55 GMT
Aad Nales wrote:

> By trying: if you type const you will find that it returns 216 hits. The
> third sports 'const' as a term (space seperated and all). I would expect
> 'conts' to return with const as well. But again I might be mistaken. I
> am now trying to figure what the problem might be: 
> 1. my expectations (most likely ;-)
> 2. something in the code..

Good question.

If I use the form at the bottom of the page and ask for more results, 
the suggestion of "const" does eventually show up - 99th however(!).

Even boosting the prefix match from 2.0 to 10.0 only changes the ranking 
a few slots.

To restate the question for a second.

The misspelled word is: "conts".
The sugggestion expected is "const", which seems reasonable enough as 
it's just a transposition away, thus the string distance is low.

But - I guess the problem w/ the algorithm is that for short words like 
this, with transpositions, the two words won't share many ngrams.

Just looking at 3grams...

conts -> con ont nts
const -> con ons nst

Thus they just share 1 3gram, thus this is why it scores so low. This is 
an interesting issue, how to tune the algorithm so that it might return 
words this close higher.

I guess one way is to add all simple transpositions to the lookup table 
(the "ngram index") so that these could easily be found, with the 
heuristic that "a frequent way of misspelling words is to transpose two 
adjacent letters".

Based on other mails I'll make some additions to the code and will 
report back if anything of interest changes here.

> -----Original Message-----
> From: Andrzej Bialecki [] 
> Sent: Wednesday, 15 September, 2004 12:23
> To: Lucene Users List
> Subject: Re: NGramSpeller contribution -- Re: combining open office
> spellchecker with Lucene
> Aad Nales wrote:
>>Perhaps I misunderstand somehting so please correct me if I do. I used
>> to look for conts without 
>>changing any of the default values. What I got as results did not 
>>include 'const' which has quite a high frequency in your index and
> ??? how do you know that? Remember, this is an index of _Java_docs, and 
> "const" is not a Java keyword.
>>should have a pretty low levenshtein distance. Any idea what causes 
>>this behavior?

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message