lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marc Hadfield <>
Subject Re: Funny results with Fuzzy
Date Tue, 25 Oct 2005 17:43:47 GMT

hello -

a fuzzy query related question:

has there been any other implementations of "fuzzy" queries other than 
edit-distance?  and/or modifications of edit-distance to less penalize 
common alternate spellings? - i.e. "couldn't" vs. "couldnt" -- here the 
apostrophe would get a smaller penalty than character mismatch.

i'm thinking specifically of the algorithms in the SecondString open 
source package:

what do you think the difficulty would be to wrap an alternate algorithm 
that provides a:
float score(String1, String2)


mark harwood wrote:

>>One thing I was thinking of doing was checking the
>>character frequency 
>An alternative idea is index-time fuzzification rather
>than query-time. This is documented in one of the case
>studies in LIA - the principle is you don't
>index/search for whole words but use an NGram Analyzer
>to break them up at index time:
>Kylie becomes multiple words:
>[ k]
>[ ky]
>[ kyl]
>[ kylie ]
>Obviously you use the same analyzer to process
>Lucene will automatically look after relevancy of
>partial matches for you but your indexes are bigger
>and your queries will generate many more Boolean
>Yahoo! Messenger - NEW crystal clear PC to PC calling worldwide with voicemail
>To unsubscribe, e-mail:
>For additional commands, e-mail:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message