lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <>
Subject [jira] Commented: (LUCENE-2102) LowerCaseFilter for Turkish language
Date Tue, 01 Dec 2009 22:31:22 GMT


Robert Muir commented on LUCENE-2102:

bq. The patch's TurkishLowerCaseFilter is as unflexible as that. The idea is just a replacement
for the current patch (and it is even a little bit more universal, because you can change
the chars to map).

Uwe this is not true. With a tokenfilter, I can use Version that will apply the logic i mentioned
bq. after finding a regular I (\u0049) we could search ahead for COMBINING DOT ABOVE (ignoring
any nonspacing marks and format and such along the way), and handle this differently.

you cannot do this with mappingchar filter, or rather, you could, but there would be millions
of mappings for this one character. I could later patch this filter with Version and some
lookahead based on unicode properties if i wanted to improve it.

> LowerCaseFilter for Turkish language
> ------------------------------------
>                 Key: LUCENE-2102
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 3.0
>            Reporter: Ahmet Arslan
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>         Attachments: LUCENE-2102.patch, LUCENE-2102.patch, LUCENE-2102.patch
> java.lang.Character.toLowerCase() converts 'I' to 'i' however in Turkish alphabet lowercase
of 'I' is not 'i'. It is LATIN SMALL LETTER DOTLESS I.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message