lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "OBender" <>
Subject Hindi, diacritics and search results
Date Fri, 10 Jul 2009 19:10:10 GMT
Hi All,


I'm using the default setup of lucene (no custom analyzers configured) and
came across the following issue:

In Hindi if there is a letter with a diacritic in a phrase lucene will find
the phrase with this letter even if the search string is for the letter
without a diacritics.

Is this an expected behavior? Maybe this is standard for all languages with
letters that have diacritics?


>From pure byte standpoint I can see the logic, the letter with diacritics
takes 6 bytes (E0 A4 95 E0 A5 87) and the single letter takes  3 (E0 A4 95)
so if I search for *some_letter* where some letter has code (E0 A4 95)
lucene finds the "phrase" (E0 A4 95 E0 A5 87) that includes that letter.


Any comments much appreciated.




  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message