spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <sro...@gmail.com>
Subject Re: [DISCUSS] upper/lower of special characters
Date Wed, 19 Sep 2018 10:35:14 GMT
I don't have the details in front of me, but I recall we explicitly
overhauled locale-sensitive toUpper and toLower in the code for this exact
situation. The current behavior should be on purpose. I believe user data
strings are handled in a case sensitive way but things like reserved words
in SQL are not of course. The Spark behavior is most correct and consistent
with Hive, right?

On Wed, Sep 19, 2018, 1:14 AM seancxmao <seancxmao@gmail.com> wrote:

> Hi, all
>
> We found that there are some differences about case handling of special
> characters between Spark and other database systems. You may see blow list
> for an example (you may also check attached pictures)
>
> select upper("i"), lower("İ"), upper("ı"), lower("I");
> ------------------------------------------------------
> Spark      I, i with dot, I, i
> Hive       I, i with dot, I, i
> Teradata   I, i,          I, i
> Oracle     I, i,          I, i
> SQLServer  I, i,          I, i
> MySQL      I, i,          I, i
>
> "İ" and "ı" are Turkish characters. If locale-sensitive case handling is
> used, the expected results of above upper/lower functions should be:
> select upper("i"), lower("İ"), upper("ı"), lower("I");
> ------------------------------------------------------
> İ, i, I, ı
>
> But, it seems that these systems all do local-insensitive mapping. Presto
> explicitly describe this as a known issue in their docs (
> https://prestodb.io/docs/current/functions/string.html)
> > The lower() and upper() functions do not perform locale-sensitive,
> context-sensitive, or one-to-many mappings required for some languages.
> Specifically, this will return incorrect results for Lithuanian, Turkish
> and Azeri.
>
> Java besed systems have same behaviors since they all depend on the same
> JDK String methods. Teradata/Oracle/SQLServer/MySQL also have same
> behaviors. However Java based systems return different results for lower(
> "İ"). Java based systems (Spark/Hive) return "i with dot" while other
> database systems(Teradata/Oracle/SQLServer/MySQL) return "i".
>
> My questions:
> (1) Should we let Spark return "i" for lower("İ"), which is same as other
> database systems?
> (2) Should Spark support locale-sensitive upper/lower functions? Because
> row of a table may need different locales, we cannot even set locale at
> table level. What we might do is to provide upper(string,
> locale)/lower(string, locale), and let users decide what locale they want
> to use.
>
> Some references below. Just FYI.
>
> *
> https://docs.oracle.com/javase/8/docs/api/java/lang/String.html#toLowerCase-java.util.Locale-
> *
> https://docs.oracle.com/javase/8/docs/api/java/lang/String.html#toUpperCase-java.util.Locale-
> * http://grepalex.com/2013/02/14/java-7-and-the-dotted--and-dotless-i/
> *
> https://stackoverflow.com/questions/3322152/is-there-a-way-to-get-rid-of-accents-and-convert-a-whole-string-to-regular-lette
>
> Your comments and advices are highly appreciated.
>
> Many thanks!
> Chenxiao Mao (Sean)
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Mime
View raw message