lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Spencer <>
Subject Re: frequent terms - Re: combining open office spellchecker with Lucene
Date Tue, 14 Sep 2004 22:00:41 GMT
Doug Cutting wrote:

> David Spencer wrote:
>> [1] The user enters a query like:
>>     recursize descent parser
>> [2] The search code parses this and sees that the 1st word is not a 
>> term in the index, but the next 2 are. So it ignores the last 2 terms 
>> ("recursive" and "descent") and suggests alternatives to 
>> "recursize"...thus if any term is in the index, regardless of 
>> frequency,  it is left as-is.
>> I guess you're saying that, if the user enters a term that appears in 
>> the index and thus is sort of spelled correctly ( as it exists in some 
>> doc), then we use the heuristic that any sufficiently large doc 
>> collection will have tons of misspellings, so we assume that rare 
>> terms in the query might be misspelled (i.e. not what the user 
>> intended) and we suggest alternativies to these words too (in addition 
>> to the words in the query that are not in the index at all).
> Almost.
> If the user enters "a recursize purser", then: "a", which is in, say, 
>  >50% of the documents, is probably spelled correctly and "recursize", 
> which is in zero documents, is probably mispelled.  But what about 
> "purser"?  If we run the spell check algorithm on "purser" and generate 
> "parser", should we show it to the user?  If "purser" occurs in 1% of 
> documents and "parser" occurs in 5%, then we probably should, since 
> "parser" is a more common word than "purser".  But if "parser" only 
> occurs in 1% of the documents and purser occurs in 5%, then we probably 
> shouldn't bother suggesting "parser".

OK, sure, got it.
I'll give it a think and try to add this option to my just submitted 
spelling code.

> If you wanted to get really fancy then you could check how frequently 
> combinations of query terms occur, i.e., does "purser" or "parser" occur 
> more frequently near "descent".  But that gets expensive.

Yeah, expensive for a large scale search engine, but probably 
appropriate for a desktop engine.

> Doug
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message