lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Spencer <>
Subject Re: frequent terms - Re: combining open office spellchecker with Lucene
Date Wed, 15 Sep 2004 20:00:54 GMT
Doug Cutting wrote:

> David Spencer wrote:
>> [1] The user enters a query like:
>>     recursize descent parser
>> [2] The search code parses this and sees that the 1st word is not a 
>> term in the index, but the next 2 are. So it ignores the last 2 terms 
>> ("recursive" and "descent") and suggests alternatives to 
>> "recursize"...thus if any term is in the index, regardless of 
>> frequency,  it is left as-is.
>> I guess you're saying that, if the user enters a term that appears in 
>> the index and thus is sort of spelled correctly ( as it exists in some 
>> doc), then we use the heuristic that any sufficiently large doc 
>> collection will have tons of misspellings, so we assume that rare 
>> terms in the query might be misspelled (i.e. not what the user 
>> intended) and we suggest alternativies to these words too (in addition 
>> to the words in the query that are not in the index at all).
> Almost.
> If the user enters "a recursize purser", then: "a", which is in, say, 
>  >50% of the documents, is probably spelled correctly and "recursize", 
> which is in zero documents, is probably mispelled.  But what about 
> "purser"?  If we run the spell check algorithm on "purser" and generate 
> "parser", should we show it to the user?  If "purser" occurs in 1% of 
> documents and "parser" occurs in 5%, then we probably should, since 
> "parser" is a more common word than "purser".  But if "parser" only 
> occurs in 1% of the documents and purser occurs in 5%, then we probably 
> shouldn't bother suggesting "parser".
> If you wanted to get really fancy then you could check how frequently 
> combinations of query terms occur, i.e., does "purser" or "parser" occur 
> more frequently near "descent".  But that gets expensive.

I updated the code to have an optional popularity filter - if true then 
it only returns matches more popular (frequent) than the word that is 
passed in for spelling correction.

If true (default) then for common words like "remove", no results are 
returned now, as expected:

But if you set it to false (bottom slot in the form at the bottom of the 
page) then the algorithm happily looks for alternatives:

TBD I need to update the javadoc & repost the code I guess. Also as per 
earlier post I also store simple transpositions for words in the 

-- Dave

> Doug
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message