lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Rowe (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (LUCENE-6993) Update UAX29URLEmailTokenizer TLDs to latest list, and upgrade all JFlex-based tokenizers to support Unicode 8.0
Date Fri, 26 Feb 2016 19:42:18 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-6993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15169638#comment-15169638
] 

Steve Rowe edited comment on LUCENE-6993 at 2/26/16 7:41 PM:
-------------------------------------------------------------

{{ClassicTokenizer}} does have direct Unicode version dependencies: {{\[:digit:]}} and {{\[:alpha:]}}
are the equivalent of {{\p\{Digit} and \p\{Letter},}} respectively.  Right now those definitions
are pinned at Unicode 3.0, which means that characters added since Unicode 3.0 (released 15
years ago, in 2000) will not be properly tokenized.

Also, there are several effectively-pinned character sets (for CJK and Thai) that are hard-coded
in the grammar, and don't include any supplementary characters at all.  If the Unicode version
changes, these will need to be moved to use the appropriate Unicode properties instead.

I guess I'm -0 on leaving the Unicode version as-is because of the above, but since this tokenizer
will never be removed, it seems bad to me to keep it pinned to such an old Unicode version.


was (Author: steve_rowe):
{{ClassicTokenizer}} does have direct Unicode version dependencies: {{\[:digit:]}} and {{\[:alpha:]}}
are the equivalent of {{\p\{Digit} and \p\{Letter},}} respectively.  Right now those definitions
are pinned at Unicode 3.0, which means that characters added since Unicode 3.0 (released 15
years ago, in 2000) will not be properly tokenized.

Also, there are several effectively-pinned character sets (for CJK) that are hard-coded in
the grammar, and don't include any supplementary characters at all.  If the Unicode version
changes, these will need to be moved to use the appropriate Unicode properties instead.

I guess I'm -0 on leaving the Unicode version as-is because of the above, but since this tokenizer
will never be removed, it seems bad to me to keep it pinned to such an old Unicode version.

> Update UAX29URLEmailTokenizer TLDs to latest list, and upgrade all JFlex-based tokenizers
to support Unicode 8.0
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-6993
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6993
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Mike Drob
>            Assignee: Robert Muir
>             Fix For: 6.0
>
>         Attachments: LUCENE-6993.patch, LUCENE-6993.patch, LUCENE-6993.patch, LUCENE-6993.patch,
LUCENE-6993.patch
>
>
> We did this once before in LUCENE-5357, but it might be time to update the list of TLDs
again. Comparing our old list with a new list indicates 800+ new domains, so it would be nice
to include them.
> Also the JFlex tokenizer grammars should be upgraded to support Unicode 8.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message