lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Rowe (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (LUCENE-6993) Update UAX29URLEmailTokenizer TLDs to latest list, and upgrade all JFlex-based tokenizers to support Unicode 8.0
Date Thu, 18 Feb 2016 22:08:18 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-6993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15153209#comment-15153209
] 

Steve Rowe edited comment on LUCENE-6993 at 2/18/16 10:07 PM:
--------------------------------------------------------------

[~mdrob], I haven't looked at your patch yet but there is a non-rote Unicode upgrade item
that needs to be dealt with - from LUCENE-5357's TODO list:

* Upgrade the UAX#29-based grammars to the Unicode -6.3- _8.0_ word break rules, in StandardTokenizerImpl.jflex
and UAX29URLEmailTokenizer.jflex.

UAX#29 word break rules can (and usually do) change with each Unicode release, so we'll need
to review the changes between 6.3 and 8.0 and see what, if anything, needs changing in the
tokenizer grammars.  Another item from the LUCENE-5357 TODO list will confirm that this has
been done correctly:

* Test the new scanners against the Unicode -6.3- _8.0_ word break test data
** \[...]


was (Author: steve_rowe):
[~mdrob], I haven't looked at your patch yet but there is a non-rote Unicode upgrade item
that needs to be dealt with - from LUCENE-5357's TODO list:

* Upgrade the UAX#29-based grammars to the Unicode -6.3- _8.0_ word break rules, in StandardTokenizerImpl.jflex
and UAX29URLEmailTokenizer.jflex.

UAX#29 word break rules can (and usually do) change with each Unicode release, so we'll need
to review the changes between 6.3 and 8.0 and see what, if anything, needs changing in the
tokenizer grammars.  Another item from the LUCENE-5357 TODO list will confirm that this has
been done correctly:

* Test the new scanners against the Unicode 6.3 word break test data
** \[...]

> Update UAX29URLEmailTokenizer TLDs to latest list, and upgrade all JFlex-based tokenizers
to support Unicode 8.0
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-6993
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6993
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Mike Drob
>            Assignee: Robert Muir
>             Fix For: 6.0
>
>         Attachments: LUCENE-6993.patch, LUCENE-6993.patch, LUCENE-6993.patch, LUCENE-6993.patch
>
>
> We did this once before in LUCENE-5357, but it might be time to update the list of TLDs
again. Comparing our old list with a new list indicates 800+ new domains, so it would be nice
to include them.
> Also the JFlex tokenizer grammars should be upgraded to support Unicode 8.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message