lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <>
Subject [jira] Commented: (LUCENE-2507) automaton spellchecker
Date Fri, 01 Oct 2010 04:48:33 GMT


Robert Muir commented on LUCENE-2507:

bq. That is a very good idea yes, but I don't think its necessary to do that before this is

Here's some *very rough* numbers from that, against the FIRE english corpus (sorry
i'm still downloading wikipedia, its quite large!)
Note, this is only relative, e.g. i dont even know if these terms all exist in that corpus.
additionally, some contain punctuation etc, i only lowercased them for consistency.

for reference, there are 547 incorrect/correct term pairs in this aspell spelling correction
My corpus has ~150,000 docs, with 304,000 unique terms in the body field.
for both spellcheckers I used all defaults, e.g. spellchecker.suggestSimilar(words[1].toLowerCase(),
1, reader, "body", true);

||impl||Number correct[1] (out of 547)||Number correct, inverted[2] (out of 547)||Avg time
in ms[3]||

1. using the misspelling as a query term, does the spellchecker return the correct spelling?
2. using the correct spelling as a query term, does the spellchecker return nothing at all?
3. this is the average time to perform an actual correction, both spellcheckers have some
way to do no work at all for the common (correctly spelled) case.

So although the benchmark itself isnt for search engine benchmarking (e.g. contains stopwords/punctuation),
this basically shows what I've been seeing, that I think this spellchecker outperforms the
existing one, and the perf cost is reasonable.

> automaton spellchecker
> ----------------------
>                 Key: LUCENE-2507
>                 URL:
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/spellchecker
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>             Fix For: 4.0
>         Attachments: LUCENE-2507.patch, LUCENE-2507.patch, LUCENE-2507.patch, LUCENE-2507.patch
> The current spellchecker makes an n-gram index of your terms, and queries this for spellchecking.
> The terms that come back from the n-gram query are then re-ranked by an algorithm such
as Levenshtein.
> Alternatively, we could just do a levenshtein query directly against the index, then
we wouldn't need
> a separate index to rebuild.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message