lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <>
Subject Re: Applying SpellChecker to a phrase
Date Mon, 03 Dec 2007 13:12:12 GMT
Have you actually tried this and done a query.toString() to see
how this is actually expanded? Not that I'm all that familiar
with SpellChecker, but before presuming how things work
you would get answers faster if you ran a test.....

And, why do you care about performance? I know that's
a silly question, but you haven't supplied any parameters
about your index and usage to give us a clue whether this
matters. If your index is 3M, you'll never see the difference
between the two ways of expanding the query. If your
index is distributed over 10 machines and is 1T, you really,
really, really care.

And under any circumstances, you can always generate
your own query of the second form by a bit of pre-processing.

More info please.....


On Dec 2, 2007 10:14 PM, smokey <> wrote:

> Suppose I have an index containing the terms impostor, imposter, fraud,
> and
> fruad, then presumably regardless of whether I spell impostor and fraud
> correctly, Lucene SpellChecker will offer the improperly spelled versions
> as
> corrections. This means that the phrase "The login fraud involves an
> impostor" would need to expand to:
> "The login fraud involves an impostor" OR "The login fruad involves an
> impostor" OR "The login fraud involves an imposter" OR "The login fruad
> involves an imposter" to cover all cases and thus find all possible
> matches.
> However, that feels like an aweful a lot of matches to perform on the
> index.
> A more efficient approach would be to expand the query to "The login
> (fraud
> OR fruad) involves an (impostor OR imposter)", which should be logically
> equivalent to the first (longer) query.
> So my question is
> (1) if others have generated the "The login (fraud OR fruad) involves an
> (impostor OR imposter)" types of queries when applying SpellChecker to a
> phrase, and agreed that this indeed performs better than the first one.
> (2) if others have observed any problems in doing so in terms of
> performance
> or anything else
> Any information would be appreciated.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message