lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Wettin (JIRA)" <>
Subject [jira] Updated: (LUCENE-626) Adaptive, user query session analyzing spell checker.
Date Fri, 04 Aug 2006 06:01:15 GMT
     [ ]

Karl Wettin updated LUCENE-626:

    Attachment: spellcheck_20060804.tar.gz

beta 3

total rewrite with focus on adaptation.

session search sequence extraction, training and suggesting are now seperate classes passed
to the spell checker.

still require lots of user interaction to build a sufficient dictionary.

has no optimization. bootstrap has been removed and will probably re-appear in future default
suggestion scheme instead. should be fast enough.

now also comes with some junit test cases.

default implementations are quite simple, but effective: strips suggestive data (trained suggestive-
and test phrases) from punctuation and whitespace in order to find incorrect composite and
decomposed words. e.g. "the davinci code" --> "the da vinci code", "a clock work orange"
--> "a clockwork orage".

beta 4 will focus on training- and suggestion classes that works on secondary trie populated
with known good data extracted from corpus, navigated with edit distance. perhaps a forest-type
trie to allow any starting point in a phrase. 


beta 4 will focus on discrimiating trained queries to build clusters and suggest (facet) classes
parallell to a plain text suggestion. that would be a major ram-consumer and require lots
of manual tweaking per implemenation, but a cool enough feature.

time will tell.

> Adaptive, user query session analyzing spell checker.
> -----------------------------------------------------
>                 Key: LUCENE-626
>                 URL:
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>            Reporter: Karl Wettin
>            Priority: Minor
>         Attachments: spellcheck_0.0.1.tar.gz, spellcheck_20060725.tar.gz, spellcheck_20060804.tar.gz
> From javadocs:
>  This is an adaptive, user query session analyzing spell checker. In plain words, a word
and phrase dictionary that will learn from how users act while searching.
> Be aware, this is a beta version. It is not finished, but yeilds great results if you
have enough user activity, RAM and a faily narrow document corpus. The RAM problem can be
fixed if you implement your own subclass of SpellChecker as the abstract methods of this class
are the CRUD methods. This will most probably change to a strategy class in future version.
> 1. Gram up results to detect compositewords that should not be composite words, and vice
> 2. Train a gramed token (markov) chain with output from an expectation maximization algorithm
(weka clusters?) parallel to a closest path (A* or bredth first?) to allow contextual suggestions
on queries that never was placed.
> Usage:
> Training
> At user query time, create an instance of QueryResults containg the query string, number
of hits and a time stamp. Add it to a chronologically ordered list in the user session (LinkedList
makes sense) that you pass on to train(sessionQueries) as the session times out.
> You also want to call the bootstrap() method every 100000 queries or so.
> Spell checking
> Call getSuggestions(query) and look at the results. Don't modify it! This method call
will be hidden in a facade in future version.
> Note that the spell checker is case sensitive, so you want to clean up query the same
way when you train as when you request the suggestions.
> I recommend something like query = query.toLowerCase().replaceAll(" ", " ").trim() 

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators:
For more information on JIRA, see:


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message