lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <>
Subject [jira] Commented: (LUCENE-2102) LowerCaseFilter for Turkish language
Date Tue, 01 Dec 2009 22:57:20 GMT


Robert Muir commented on LUCENE-2102:

Uwe, it *is* specific to the turkish case.
because for german, whether you have A, umlaut or A+umlaut as one character, it works regardless.
turkish is the only case where its more complex, because the casing of the character actually
depends upon a diacritic that may not be composed, and may have other diacritics in between.

this is what makes it such a bear to support in case folding:

#      Note that the Turkic mappings do not maintain canonical equivalence without additional
#      See the discussions of case mapping in the Unicode Standard for more information.

The problem is that context is required, and sometimes marks must actually be deleted for
proper casing.

# When lowercasing, remove dot_above in the sequence I + dot_above, which will turn into i.
# This matches the behavior of the canonically equivalent I-dot_above

0307; ; 0307; 0307; tr After_I; # COMBINING DOT ABOVE
0307; ; 0307; 0307; az After_I; # COMBINING DOT ABOVE

# When lowercasing, unless an I is before a dot_above, it turns into a dotless i.

0049; 0131; 0049; 0049; tr Not_Before_Dot; # LATIN CAPITAL LETTER I
0049; 0131; 0049; 0049; az Not_Before_Dot; # LATIN CAPITAL LETTER I

bq. but the last time I was there, they just used the simpliest composed chars (like germans).

This is why i recommended we not go crazy and only work on the composed form. But in the future
we might want to correct this.
this is *impossible* to do with mappingcharfilter, that is my only point.

> LowerCaseFilter for Turkish language
> ------------------------------------
>                 Key: LUCENE-2102
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 3.0
>            Reporter: Ahmet Arslan
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>         Attachments: LUCENE-2102.patch, LUCENE-2102.patch, LUCENE-2102.patch
> java.lang.Character.toLowerCase() converts 'I' to 'i' however in Turkish alphabet lowercase
of 'I' is not 'i'. It is LATIN SMALL LETTER DOTLESS I.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message