spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <>
Subject Re: [DISCUSS] upper/lower of special characters
Date Wed, 19 Sep 2018 10:35:14 GMT
I don't have the details in front of me, but I recall we explicitly
overhauled locale-sensitive toUpper and toLower in the code for this exact
situation. The current behavior should be on purpose. I believe user data
strings are handled in a case sensitive way but things like reserved words
in SQL are not of course. The Spark behavior is most correct and consistent
with Hive, right?

On Wed, Sep 19, 2018, 1:14 AM seancxmao <> wrote:

> Hi, all
> We found that there are some differences about case handling of special
> characters between Spark and other database systems. You may see blow list
> for an example (you may also check attached pictures)
> select upper("i"), lower("İ"), upper("ı"), lower("I");
> ------------------------------------------------------
> Spark      I, i with dot, I, i
> Hive       I, i with dot, I, i
> Teradata   I, i,          I, i
> Oracle     I, i,          I, i
> SQLServer  I, i,          I, i
> MySQL      I, i,          I, i
> "İ" and "ı" are Turkish characters. If locale-sensitive case handling is
> used, the expected results of above upper/lower functions should be:
> select upper("i"), lower("İ"), upper("ı"), lower("I");
> ------------------------------------------------------
> İ, i, I, ı
> But, it seems that these systems all do local-insensitive mapping. Presto
> explicitly describe this as a known issue in their docs (
> > The lower() and upper() functions do not perform locale-sensitive,
> context-sensitive, or one-to-many mappings required for some languages.
> Specifically, this will return incorrect results for Lithuanian, Turkish
> and Azeri.
> Java besed systems have same behaviors since they all depend on the same
> JDK String methods. Teradata/Oracle/SQLServer/MySQL also have same
> behaviors. However Java based systems return different results for lower(
> "İ"). Java based systems (Spark/Hive) return "i with dot" while other
> database systems(Teradata/Oracle/SQLServer/MySQL) return "i".
> My questions:
> (1) Should we let Spark return "i" for lower("İ"), which is same as other
> database systems?
> (2) Should Spark support locale-sensitive upper/lower functions? Because
> row of a table may need different locales, we cannot even set locale at
> table level. What we might do is to provide upper(string,
> locale)/lower(string, locale), and let users decide what locale they want
> to use.
> Some references below. Just FYI.
> *
> *
> *
> *
> Your comments and advices are highly appreciated.
> Many thanks!
> Chenxiao Mao (Sean)
> ---------------------------------------------------------------------
> To unsubscribe e-mail:

View raw message