lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: Lucene's default settings & back compatibility
Date Tue, 19 May 2009 11:26:08 GMT
I like the idea, some thoughts below.

On May 18, 2009, at 5:06 PM, Michael McCandless wrote:

> As we all know, Lucene's back-compat policy necessarily hurts the
> out-of-the-box experience for new users: because we are only allowed
> make substantial improvements to Lucene's default settings at a major
> release, new users won't see the improvements to our settings until a
> major release (typically years apart).
> Lucene has a number of default settings, eg some recent examples:
>  * Read-only IndexReader gives better much performance with threads,
>    yet we must now default to return a non-readOnly
>    reader
>  * We can now optionally turn off scoring when sorting by field
>    (sizable speed gain), but we had to leave it on by default until
>    3.0
>  * Letting IndexReader.norms return null
>  * LogMergePolicy now takes deletions into account, but we had to
>    disable it by default, since it could conceivably break back
>    compat.
>  * Bug fixes in StandardAnalyzer must be delayed until 3.0 since
>    there's a remote chance they'd break back compat in an app, or we
>    end up adding confusing methods like "public static void
>    setDefaultReplaceInvalidAcronym".

I don't think we have said that bug fixes are required to be back  
compatible, even if it is in analysis.  I think it is a really bad  
idea for TokenStreams to have if clauses in them checking boolean  
values for old versus new behaviors.

When they can be back compat, we do, but there is not a requirement.   
For instance, we upgraded Snowball.

>  * NIOFSDirectory ought to be "the default" on UNIX, but it's not
>  * Constant score rewrite ought to be the default for most multi-term
>    queries
>  * StopFilter should enable position increments by default

Or, the removal of StopFilter as "Standard" all together.  This  
coupled with a QP that created phrases around stop words is a better  

> The fact that we are "forced" delay such "out of the box" improvements
> to Lucene for so long is a frustrating cost, since it can only stunt
> Lucene's adoption and growth and my sense is that it's a minority of
> Lucene's users that need such strict back-compat (this has been
> discussed before).  It also clutters our APIs because we end up
> creating setter/getters that often only exist for the sake of a back
> compat preservation of a bug.
> I think we can fix this.  Ie, maintain our strong back-compat policy,
> yet still allow new users to experience the best of Lucene on every
> release (not just on major releases), by creating an explicit class
> that holds settings/defaults used by Lucene.
> For example, say we create a base class named Settings.  It holds the
> defaults for settings across all of Lucene's classes. When you create
> IndexReader, IndexWriter and others, you must pass in a Settings
> instance.
> A subclass, SettingsMatching24, binds all settings to "match" 2.4's
> behavior.  When we make improvements in 2.9, we'd add the back-compat
> settings to SettingsMatching24.  So if your app wants to keep exactly
> 2.4's behavior, you'd pass in SettingsMatching24().  On upgrading to
> 2.9 you'd still see 2.4's behavior.
> Users who'd like to see Lucene's improvements on each minor release
> would instead instantiate LatestAndGreatestSettings() (or
> CurrentVersionSettings(), or something), understanding that when they
> upgrade there might be biggish changes to Lucene's defaults.  My guess
> is most users would use this settings class.
> Doug actually suggested this exact idea a while back:
> Now that I realize we could use this to strongly decouple "users
> wanting precise back-compat" from "users wanting the latest &
> greatest", I think it's a very compelling solution.
> If we do this I'd like to do it in 2.9, so that starting with 3.x we
> are free to change default settings w/o breaking back compat.
> Thoughts?

For instance, if we removed the StopFilter from the StandardAnalyzer,  
then what?  A Settings object would not be able to account for it.    
Likewise, the subtler issue of "fixing" a TokenStream such that it  
might produce different tokens.

I really worry about Settings objects having to be repeatedly checked  
inside of tight inner loops.  Even looking at the new TokenStream  
stuff, there are now checks for the "new API" in an area that is  
called _a lot_ of times.

Last, and mostly I mention it as an afterthought.  How are you going  
to handle changes to the Settings?  Say, for instance, we come out w/  
Settings2.4, release it and then we realize we missed something (and  
this seems likely given the number of settings available in Lucene),  
then what?  We deprecate Settings2.4 and come out with  
TheRealSettingsFor2.4?  And then when that is incomplete?

I still think we would benefit from just communicating upcoming  
changes better even in minor releases, thereby allowing for a bit more  
variance in back compat.  It should be the exception, not the rule.

Still, I think this is worth pursuing.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message