lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-7916) CompositeBreakIterator is brittle under ICU4J upgrade.
Date Thu, 03 Aug 2017 00:36:00 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-7916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111977#comment-16111977
] 

Robert Muir commented on LUCENE-7916:
-------------------------------------

{quote}
In our case, we are using ICUTokenizer but we have modified the default ruleset of RuleBasedBreakIterator
to break on emoji characters so that we can search for emoji in text.
{quote}

Cool!

{quote}
The underlying issue for us is that Lucene 6.6.0 is pegged to a fairly old version of ICU.
In hindsight it might have been safer for us to fork lucene-analyzers-icu temporarily to build
our own internal release against ICU 59.1.
{quote}

Yeah, when we upgrade ICU versions we run a script the regenerates normalization and segmentation
datafiles for that specific ICU jar / unicode version: {{ant regenerate}} from lucene/analyzers/icu.
So at the minimum this should really be done (followed of course by {{ant test}}) so that
things work correctly. 

{quote}
>From what I've seen in JIRA and the git repo, it looks like 6.7 is targeted at ICU 59.1.
Is there an ETA for the release of 6.7?
{quote}

I'm not sure, maybe ask the dev list about this? But it seems most work is towards 7.0 and
onwards. 

The real problem was falling so far behind on ICU versions. You can see why if you look at
the ticket: LUCENE-7540. Mainly, a bug (http://bugs.icu-project.org/trac/ticket/12873) was
introduced into ICU that our test suite detected but we didn't know why. This was fixed in
ICU 59.1 so we were then able to upgrade.

> CompositeBreakIterator is brittle under ICU4J upgrade.
> ------------------------------------------------------
>
>                 Key: LUCENE-7916
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7916
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: 6.6
>            Reporter: Chris Koenig
>         Attachments: LUCENE-7916.patch, LUCENE-7916.patch
>
>
> We use lucene-analyzers-icu version 6.6.0 in our project. Lucene 6.6.0 is built against
ICU4J version 56.1, but our use case requires us to use the latest version of ICU4J, 59.1.
> The problem that we have encountered is that CompositeBreakIterator.getBreakIterator(int
scriptCode) throws an ArrayIndexOutOfBoundsException for script codes higher than 167. In
ICU4J 56.1 the highest possible script code is 166, but in ICU4j 59.1 it is 174.
> Internally, CompositeBreakIterator is creating an array of size UScript.CODE_LIMIT, but
the value of CODE_LIMIT from ICU4J 56.1 is being baked into the bytecode by the compiler.
So even after overriding the version of the ICU4J dependency to 59.1 in our project, this
array will still be size 167, which is too small.
> {code}
> final class CompositeBreakIterator {
>   private final ICUTokenizerConfig config;
>   private final BreakIteratorWrapper wordBreakers[] = new BreakIteratorWrapper[UScript.CODE_LIMIT];
> {code}
> Output of javap run on CompositeBreakIterator.class from lucene-analyzers-icu-6.6.0.jar
> {code}
> Compiled from "CompositeBreakIterator.java"
> final class org.apache.lucene.analysis.icu.segmentation.CompositeBreakIterator {
>   org.apache.lucene.analysis.icu.segmentation.CompositeBreakIterator(org.apache.lucene.analysis.icu.segmentation.ICUTokenizerConfig);
>     descriptor: (Lorg/apache/lucene/analysis/icu/segmentation/ICUTokenizerConfig;)V
>     Code:
>        0: aload_0
>        1: invokespecial #1                  // Method java/lang/Object."<init>":()V
>        4: aload_0
>        5: sipush        167
>        8: anewarray     #3                  // class org/apache/lucene/analysis/icu/segmentation/BreakIteratorWrapper
> {code}
> In our case, the ArrayIndexOutOfBoundsException was triggered when we encountered a stray
character of the Bhaiksuki script (script code 168) in a chunk of text that we processed.
> CompositeBreakIterator can be made more resilient by changing the type of wordBreakers
from an array to a Map and no longer relying on the value of UScript.CODE_LIMIT.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message