lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From DM Smith <>
Subject Re: Why release 3.0?
Date Tue, 17 Nov 2009 01:17:32 GMT

On Nov 16, 2009, at 7:53 PM, Robert Muir wrote:

> right, the only way you could really contain it would be to do something like that.

I'm looking forward to your ICU analyzer! IMHO, it be great to have it be a pluggable replacement
for it's counterparts in core. That is, using reflection, if the jar is present, then use

> I just think we should make users aware of this, thats all. 

I've been reading the thread and at first my response was. No big deal, it won't affect me
(i.e. awareness of the problem). And now my thought is "I'm hosed" (i.e. understanding).

I think we need a mechanism (I mentioned this before) to build a manifest of the parts of
the tool chain that builds each field in an index. Then if any part is revisioned in a fashion
that is not 100% bw compat, then we'd know.

As it is, I'm just going to mark each index as dirty on each upgrade to Lucene, Java or ICU.
And force a rebuild.

> and I think it sucks they might have to reindex twice with the current status of things
(we did not complete unicode 4 support in lucene 3.0)
> which is why i mentioned this problem on the unicode 4 issues im trying to work.

Whether 3.0 goes out as it is now or with these fixes is up to the voters.

> 2.9->3.0 (to upgrade from Unicode 3 to Unicode 4-halfass)
> 3.0->3.1 (to upgrade from Unicode 4-halfass to Unicode 4-correct) [hopefully]

If this is the path, then perhaps the best advice is to skip 3.0 and take the pain once.

> btw, i created a diff from unicode 3's UCD to unicode 4's UCD, in case you want to see
the changes:

That's an amazing number of changes, even when you ignore name changes.

> On Mon, Nov 16, 2009 at 7:42 PM, DM Smith <> wrote:
> On Nov 16, 2009, at 6:43 PM, Robert Muir wrote:
> > DM, in this case I'm not referring to surrogates, etc, but instead the idea that
properties for an existing character can change (the soft hyphen and arabic ayah were two
examples), also new characters are introduced.
> >
> > these will affect what analysis components (ex. tokenizers) do, because they like
to use categories such as .isWhiteSpace, .isLetter, things like that.
> >
> > this means these components have different behavior, because they are data-driven,
even though we didnt change any code.
> Then why not make ICU a dependency. At least then one has control of the delivered version.
Any of us that are working with texts in non latin-1 languages are likely to be using ICU
> -- DM
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:
> -- 
> Robert Muir

View raw message