lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven Rowe (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2847) Support all of unicode in StandardTokenizer
Date Thu, 06 Jan 2011 05:21:48 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12978141#action_12978141
] 

Steven Rowe commented on LUCENE-2847:
-------------------------------------

bq. We could also consolidate tools, because in general i would rather all the analyzers be
consolidated, they are only split up due to dependencies/large files etc. But tools are different,
its just to assist the build.

How far would you go with this tools consolidation?  All tools across the whole of Scenolunr?
 Or just the ones under {{modules/analysis/}}?

> Support all of unicode in StandardTokenizer
> -------------------------------------------
>
>                 Key: LUCENE-2847
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2847
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>            Reporter: Robert Muir
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2847.patch, LUCENE-2847.patch, LUCENE-2847.patch
>
>
> StandardTokenizer currently only supports the BMP.
> If it encounters characters outside of the BMP, it just discards them... 
> it should instead implement fully implement UAX#29 across all of unicode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message