lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joaquin Delgado <>
Subject Re: Reordering search results
Date Mon, 03 Oct 2005 18:05:32 GMT
Chris, you may consider using a modified version of the Nutch analysis 
which has a very slick treatment of stopwords. Please refer to chapter 
4, page 145 of the Lucene in Action written by Eric and Otis for some 
details about the nutch implementation.

-- J.D.

Erik Hatcher wrote:

> On Oct 3, 2005, at 4:56 AM, Chris Lamprecht wrote:
>>> 1- Words in Document that are more close to original search terms  have
>>> a larger Score. For example, if I was searching for "wellcome",
>>> Document("wellcome") must be better than Document("welcome")
>> I'm just "thinking outloud" here, but some ideas that come to mind
>> are:  Index both the original text (with spelling errors), and the
>> spelling-corrected text.  When you search, search on both the
>> corrected text, and in a non-required query clause search on the
>> uncorrected text, maybe boosted down a bit.  This way, if the spelling
>> was correct, it will match both the original term and the corrected
>> term (since they're the same), but a document with a misspelling would
>> match only the corrected term.  You'll have to experiment with boosts
>> and relevance/rankings here.
>> Another idea is, if you know the number of misspellings made at
>> indexing time (it seems like you do), then boost documents based on
>> the number of spelling errors -- higher boost factor for fewer errors.
> Another tip is that score is based on term frequency - so when  
> tokenizing correct spellings, add multiple of the correct words to  
> weight towards them.
>>> 2- Documents that have search terms close to each other, have a  larger
>>> Score. For example, if I was searching for "welcome there",
>>> Document("welcome there") must be better than Document("welcome all
>>> there"). Note that "all" is a stop word in my implementation.
>> PhraseQuery with a high slop factor (MAX_INT works) scores higher for
>> terms that are closer together.  You can construct the PhraseQuery
>> yourself (programmatically), or QueryParser takes it as:
>> "welcome there"~99999
>> (with the quotes)  99999 is the slop factor, which means to accept
>> documents where "welcome" is within 99999 positions from "there".
> The issue is that "all" is a stop word, though.  The StopFilter does  
> not leave a hole when stop words are removed, so indexing "welcome  
> all there" is exactly the same as indexing "welcome there" as far as  
> the index is concerned.  I started to address this situation in the  
> 1.4.x Lucene releases but it introduced a backward incompatible issue  
> so we reverted.  Care must be taken on the Query side of things -  
> PhraseQuery did not deal with anything but term position increments  
> of 1, but this has been addressed in the latest codebase (in  
> Subversion).
> I built a PositionalStopFilter for and discussed these details in the  
> Analysis chapter of "Lucene in Action" - it is available in the  code 
> .zip at
>     Erik
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message