lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2102) LowerCaseFilter for Turkish language
Date Tue, 01 Dec 2009 22:57:20 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784499#action_12784499
] 

Robert Muir commented on LUCENE-2102:
-------------------------------------

Uwe, it *is* specific to the turkish case.
because for german, whether you have A, umlaut or A+umlaut as one character, it works regardless.
turkish is the only case where its more complex, because the casing of the character actually
depends upon a diacritic that may not be composed, and may have other diacritics in between.

this is what makes it such a bear to support in case folding:

{noformat}
#      Note that the Turkic mappings do not maintain canonical equivalence without additional
processing.
#      See the discussions of case mapping in the Unicode Standard for more information.
{noformat}

The problem is that context is required, and sometimes marks must actually be deleted for
proper casing.

{noformat}
# When lowercasing, remove dot_above in the sequence I + dot_above, which will turn into i.
# This matches the behavior of the canonically equivalent I-dot_above

0307; ; 0307; 0307; tr After_I; # COMBINING DOT ABOVE
0307; ; 0307; 0307; az After_I; # COMBINING DOT ABOVE

# When lowercasing, unless an I is before a dot_above, it turns into a dotless i.

0049; 0131; 0049; 0049; tr Not_Before_Dot; # LATIN CAPITAL LETTER I
0049; 0131; 0049; 0049; az Not_Before_Dot; # LATIN CAPITAL LETTER I
{noformat}

bq. but the last time I was there, they just used the simpliest composed chars (like germans).

This is why i recommended we not go crazy and only work on the composed form. But in the future
we might want to correct this.
this is *impossible* to do with mappingcharfilter, that is my only point.

> LowerCaseFilter for Turkish language
> ------------------------------------
>
>                 Key: LUCENE-2102
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2102
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 3.0
>            Reporter: Ahmet Arslan
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2102.patch, LUCENE-2102.patch, LUCENE-2102.patch
>
>
> java.lang.Character.toLowerCase() converts 'I' to 'i' however in Turkish alphabet lowercase
of 'I' is not 'i'. It is LATIN SMALL LETTER DOTLESS I.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message