lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Spencer <>
Subject Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene
Date Wed, 15 Sep 2004 19:57:00 GMT
Andrzej Bialecki wrote:

> David Spencer wrote:
>> To restate the question for a second.
>> The misspelled word is: "conts".
>> The sugggestion expected is "const", which seems reasonable enough as 
>> it's just a transposition away, thus the string distance is low.
>> But - I guess the problem w/ the algorithm is that for short words 
>> like this, with transpositions, the two words won't share many ngrams.
>> Just looking at 3grams...
>> conts -> con ont nts
>> const -> con ons nst
>> Thus they just share 1 3gram, thus this is why it scores so low. This 
>> is an interesting issue, how to tune the algorithm so that it might 
>> return words this close higher.
> If you added 2-grams, then it would look like this (constructing also 
> special start/end grams):

Oh cute trick to indicate prefixes and suffixes.
Anyway, as per prev post I reformed index w/ ngrams from length 2 to 5, 
and also store transpositions, and w/ appropriate boosts :) then "const" 
is returned 2nd.

> conts -> _c co on nt ts s_
> const -> _c co on ns st t_
> which gives 50% of overlap.
> In another system that I designed we were using a combination of 2-4 
> grams, albeit for a slightly different purpose, so in this case it would 
> be:
> conts:
>     _c co on nt ts s_, _co con ont nts ts_, _con cont onts nts_
> const:
>     _c co on ns st t_, _co con ons nst st_, _con cons onst nst_
> and the overlap is 40%.
>> I guess one way is to add all simple transpositions to the lookup 
>> table (the "ngram index") so that these could easily be found, with 
>> the heuristic that "a frequent way of misspelling words is to 
>> transpose two adjacent letters".
> Yes, sounds like a good idea. Even though it increases the size of the 
> lookup index, it still beats using the linear search...

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message