lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki>
Subject Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene
Date Wed, 15 Sep 2004 18:29:50 GMT
David Spencer wrote:
> To restate the question for a second.
> The misspelled word is: "conts".
> The sugggestion expected is "const", which seems reasonable enough as 
> it's just a transposition away, thus the string distance is low.
> But - I guess the problem w/ the algorithm is that for short words like 
> this, with transpositions, the two words won't share many ngrams.
> Just looking at 3grams...
> conts -> con ont nts
> const -> con ons nst
> Thus they just share 1 3gram, thus this is why it scores so low. This is 
> an interesting issue, how to tune the algorithm so that it might return 
> words this close higher.

If you added 2-grams, then it would look like this (constructing also 
special start/end grams):

conts -> _c co on nt ts s_
const -> _c co on ns st t_

which gives 50% of overlap.

In another system that I designed we were using a combination of 2-4 
grams, albeit for a slightly different purpose, so in this case it would be:

	_c co on nt ts s_, _co con ont nts ts_, _con cont onts nts_

	_c co on ns st t_, _co con ons nst st_, _con cons onst nst_

and the overlap is 40%.

> I guess one way is to add all simple transpositions to the lookup table 
> (the "ngram index") so that these could easily be found, with the 
> heuristic that "a frequent way of misspelling words is to transpose two 
> adjacent letters".

Yes, sounds like a good idea. Even though it increases the size of the 
lookup index, it still beats using the linear search...

Best regards,
Andrzej Bialecki

Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
FreeBSD developer (

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message