spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marco Gaido <marcogaid...@gmail.com>
Subject Re: Decimals
Date Fri, 22 Dec 2017 11:11:59 GMT
Hi Xiao, hi all,

I checked the DB2 documentation for which you provided me a link in the PR
(thanks for it), and what you are stating is not really right.
DB2, in compliance with the SQL standards, throws an exception if an
overflow occurs, ie. if a loss of significant digits is necessary to
properly represent the value, which is the case I discussed as point 3 of
the previous e-mail. Since DB2 has a maximum precision of 31, while the
other DBs have an higher precision (in particular SQLServer and Hive as
Spark have 38 as maximum precision), the same operation running fine on
Hive or SQLServer (or Spark after my PR) may throw an exception on DB2 (but
this is due to overflow, ie. point 3, not to loss of precision).

In the case of loss of precision, in compliance with SQL standards, DB2
performs truncation. (emitting just a warning) And it can truncate more
than us, since we are ensuring to save at least 6 digits for the scale,
while DB2 has a minimum of 0 for the scale (refer to
https://www.ibm.com/support/knowledgecenter/SSEPEK_10.0.0/sqlref/src/tpc/db2z_decimalmultiplication.html).
I'm citing the relevant part here for convenience:

The truncated copy has a scale of MAX(0,S-(P-15)), where P and S are the
> original precision and scale. If, in the process of truncation, one or more
> nonzero digits are removed, SQLWARN7 in SQLCA is set to W, indicating loss
> of precision.
>

Moreover, the rules applied by DB2 to determine precision are scale are
analogous with the one used by Hive and SQLServer. The only real practical
difference is that we are enforcing to keep at least 6 as value for scale,
while DB2 has 0 as minimum (which is even worse than our approach according
to your previous e-mail about the financial use case).

For the brave ones who went on reading so far, I'll summarize the situation
of the 3 point above, adding DB2 to the comparison:

 1. *Rules to determine precision and scale*
     - *Hive, SQLServer (and Spark after the PR)*: I won't include the
exact formulas, anyway the relevant part is that in case of precision
higher that the maximum value, we use the maximum available value (38) as
precision and the maximum between the needed scale (computing according the
relevant formula) and a minimum value guaranteed for the scale which is 6.
     - *DB2*: practically same rules as above. Main difference are: the
maximum precision is 31 and it doesn't enforce any minimum value for the
scale (or the minimum value guaranteed for the scale is 0).
     - *Postgres and Oracle*: NA, they have nearly infinite precision...
     - *SQL ANSI 2011*: no indication
     - *Spark now*: if the precision needed is more than 38, use 38 as
precision; use the needed scale without any adjustment.

  2. *Behavior in case of precision loss but result in the range of the
representable values*
     - *Hive, SQLServer (and Spark after the PR)*: round the result.
     - *DB2*: truncates the result (and emits a warning).
     - *Postgres and Oracle*: NA, they have nearly infinite precision...
     - *SQL ANSI 2011*: either truncate or round the value.
     - *Spark now*: returns NULL.

  3. *Behavior in case of result out of the range of the representable
values (i.e overflow)*
     - *DB2, **SQLServer*: throw an exception.
     - *Postgres and Oracle*: NA, they have nearly infinite precision...
     - *SQL ANSI 2011*: an exception should be raised
     - *Spark now, Hive*: return NULL (for Hive, there is a open ticket to
make it compliant to the SQL standard).

Therefore, I am always more convinced that the behavior proposed in the PR
for points 1 and 2 is the right one. For 3, I am not sure because Hive
behaves differently and now we are compliant to Hive. I would propose to
adhere to the SQL standard, but I am open to discuss it (indeed I'd really
love some feedbacks by the community on it).

Thanks,
Marco


2017-12-22 3:58 GMT+01:00 Marco Gaido <marcogaido91@gmail.com>:

> Thanks for your answer Xiao. The point is that behaving like this is
> against SQL standard and is different also from Hive's behavior. Then I
> would propose to add a configuration flag to switch between the two
> behaviors, either being SQL compliant and Hive compliant or behaving like
> now (as Hermann was suggesting in the PR). Do we agree on this way? If so,
> is there any way to read a configuration property in the catalyst project?
>
> Thank you,
> Marco
> ------------------------------
> Da: Xiao Li <gatorsmile@gmail.com>
> Inviato: ‎21/‎12/‎2017 22:46
> A: Marco Gaido <marcogaido91@gmail.com>
> Cc: Reynold Xin <rxin@databricks.com>; dev@spark.apache.org
> Oggetto: Re: Decimals
>
> Losing precision is not acceptable to financial customers. Thus, instead
> of returning NULL, I saw DB2 issues the following error message:
>
> SQL0802N  Arithmetic overflow or other arithmetic exception occurred.
> SQLSTATE=22003
>
> DB2 on z/OS is being used by most of biggest banks and financial
> intuitions since 1980s. Either issuing exceptions (what DB2 does) or
> returning NULL (what we are doing) looks fine to me. If they want to avoid
> getting NULL or exceptions, users should manually putting the round
> functions by themselves.
>
> Also see the technote of DB2 zOS: http://www-01.ibm.com/
> support/docview.wss?uid=swg21161024
>
>
>
>
>
>
> 2017-12-19 8:41 GMT-08:00 Marco Gaido <marcogaido91@gmail.com>:
>
>> Hello everybody,
>>
>> I did some further researches and now I am sharing my findings. I am
>> sorry, it is going to be a quite long e-mail, but I'd really appreciate
>> some feedbacks when you have time to read it.
>>
>> Spark's current implementation of arithmetic operations on decimals was
>> "copied" from Hive. Thus, the initial goal of the implementation was to be
>> compliant with Hive, which itself aims to reproduce SQLServer behavior.
>> Therefore I compared these 3 DBs and of course I checked the SQL ANSI
>> standard 2011 (you can find it at http://standards.iso.org/ittf/
>> PubliclyAvailableStandards/c053681_ISO_IEC_9075-1_2011.zip) and a late
>> draft of the standard 2003 (http://www.wiscorp.com/sql_2003_standard.zip).
>> The main topics are 3:
>>
>>    1. how to determine the precision and scale of a result;
>>    2. how to behave when the result is a number which is not
>>    representable exactly with the result's precision and scale (ie. requires
>>    precision loss);
>>    3. how to behave when the result is out of the range of the
>>    representable values with the result's precision and scale (ie. it is
>>    bigger of the biggest number representable or lower the lowest one).
>>
>> Currently, Spark behaves like follows:
>>
>>    1. It follows some rules taken from intial Hive implementation;
>>    2. it returns NULL;
>>    3. it returns NULL.
>>
>>
>> The SQL ANSI is pretty clear about points 2 and 3, while it says barely
>> nothing about point 1, I am citing SQL ANSI:2011 page 27:
>>
>> If the result cannot be represented exactly in the result type, then
>>> whether it is rounded
>>> or truncated is implementation-defined. An exception condition is raised
>>> if the result is
>>> outside the range of numeric values of the result type, or if the
>>> arithmetic operation
>>> is not defined for the operands.
>>
>>
>> Then, as you can see, Spark is not respecting the SQL standard neither
>> for point 2 and 3. Someone, then might argue that we need compatibility
>> with Hive. Then, let's take a look at it. Since Hive 2.2.0 (HIVE-15331),
>> Hive's behavior is:
>>
>>    1. Rules are a bit changed, to reflect SQLServer implementation as
>>    described in this blog (https://blogs.msdn.microsoft.
>>    com/sqlprogrammability/2006/03/29/multiplication-and-divisio
>>    n-with-numerics/
>>    <https://blogs.msdn.microsoft.com/sqlprogrammability/2006/03/29/multiplication-and-division-with-numerics/>
>>    );
>>    2. It rounds the result;
>>    3. It returns NULL (HIVE-18291 is open to be compliant with SQL ANSI
>>    standard and throw an Exception).
>>
>> As far as the other DBs are regarded, there is little to say about Oracle
>> and Postgres, since they have a nearly infinite precision, thus it is hard
>> also to test the behavior in these conditions, but SQLServer has the same
>> precision as Hive and Spark. Thus, this is SQLServer behavior:
>>
>>    1. Rules should be the same as Hive, as described on their post
>>    (tests about the behavior confirm);
>>    2. It rounds the result;
>>    3. It throws an Exception.
>>
>> Therefore, since I think that Spark should be compliant to SQL ANSI
>> (first) and Hive, I propose the following changes:
>>
>>    1. Update the rules to derive the result type in order to reflect new
>>    Hive's one (which are SQLServer's one);
>>    2. Change Spark behavior to round the result, as done by Hive and
>>    SQLServer and prescribed by the SQL standard;
>>    3. Change Spark's behavior, introducing a configuration parameter in
>>    order to determine whether to return null or throw an Exception (by default
>>    I propose to throw an exception in order to be compliant with the SQL
>>    standard, which IMHO is more important that being compliant with Hive).
>>
>> For 1 and 2, I prepared a PR, which is https://github.com/apache/spar
>> k/pull/20023. For 3, I'd love to get your feedbacks in order to agree on
>> what to do and then I will eventually do a PR which reflect what decided
>> here by the community.
>> I would really love to get your feedback either here or on the PR.
>>
>> Thanks for your patience and your time reading this long email,
>> Best regards.
>> Marco
>>
>>
>> 2017-12-13 9:08 GMT+01:00 Reynold Xin <rxin@databricks.com>:
>>
>>> Responses inline
>>>
>>> On Tue, Dec 12, 2017 at 2:54 AM, Marco Gaido <marcogaido91@gmail.com>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I saw in these weeks that there are a lot of problems related to
>>>> decimal values (SPARK-22036, SPARK-22755, for instance). Some are related
>>>> to historical choices, which I don't know, thus please excuse me if I am
>>>> saying dumb things:
>>>>
>>>>  - why are we interpreting literal constants in queries as Decimal and
>>>> not as Double? I think it is very unlikely that a user can enter a number
>>>> which is beyond Double precision.
>>>>
>>>
>>> Probably just to be consistent with some popular databases.
>>>
>>>
>>>
>>>>  - why are we returning null in case of precision loss? Is this
>>>> approach better than just giving a result which might loose some accuracy?
>>>>
>>>
>>> The contract with decimal is that it should never lose precision (it is
>>> created for financial reports, accounting, etc). Returning null is at least
>>> telling the user the data type can no longer support the precision required.
>>>
>>>
>>>
>>>>
>>>> Thanks,
>>>> Marco
>>>>
>>>
>>>
>>
>

Mime
View raw message