spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Greg Rahn <greg.r...@gmail.com>
Subject Re: Allow average out of a Date
Date Fri, 21 Aug 2020 16:13:17 GMT
Fokko-

I can certainly understand the ask for this behavior and it seems simple
and/or straightforward at face value -- as R's Date Class behaves this way,
however, I think there are several challenges involved here, as AFAIK,
there seems to be no well defined way SQL systems do this implicitly today
-- and I think there is a clear explanation of why.

As you might know, AVG() is the equivalent of SUM()/COUNT() -- so if AVG()
works on DATE, then SUM() (and the plus operator) would need to work as
well -- but what is the definition of SUM() for a list of DATEs or DATE +
DATE? There isn't any such definition in the SQL world. And take a very
simple example -- what's the AVG() of today and yesterday? If you represent
these dates as their Julian integer value, the average of two
consecutive integers is a floating point number between them -- so now
what?  Should it be ceil() or floor()? Round down or round up? Should it be
yesterday or today?  It seems very challenging to deal with this in a
formal and standardized way.

It's my understanding that it's this ambiguity, and the need for an
intermediate numerical representation like the Julian Day Number, that
limits this from being an implicit SQL data type conversion operation and
why users are forced to choose their own explicit conversion (to Julian
date, for example) in order to do such mathematical operations. As we see
in the SQL:2016 standard cast specification and conversion table (below) --
there is no standardized way to implicitly cast between DATE and numeric
types because there is not just one numerical representation of a DATE.

If we look back at the use case from SPARK-10520 which is about R's summary
function on a vector of dates -- (and I don't claim to be an R internals
expert) but maybe it makes more sense to implement R's Date Class behavior
in SparkR vs more generically across multiple languages and types.  If
possible, this would seem to yield the desired behavior for the use case in
question, but not broaden the requirement so wide that it becomes too
challenging to implement and conflicts with other well defined standards,
like ANSI SQL:2016.


[image: Screen Shot 2020-08-21 at 12.00.01 PM.png]


For reference:
https://github.com/apache/spark/pull/28754
https://issues.apache.org/jira/browse/SPARK-10520




On Wed, Aug 19, 2020 at 9:34 PM Driesprong, Fokko <fokko@driesprong.frl>
wrote:

> Hi all,
>
> Personally, I'm a big fan of the .summary() function to compute statistics
> of a dataframe. I often use this for debugging pipelines, and check what
> the impact of the RDD is after changing code.
>
> I've noticed that not all datatypes are in this summary. Currently, there
> is a list
> <https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala#L267>
> of types that allowed to be included in the summary, and I love to extend
> that list.
>
> The first one is the date type. It is important to define this together
> with the community, and that we get consensus, as this will be part of the
> public API. Changing this will be costly (or maybe impossible) to do.
>
> I've checked what other DBMS'es do with averages out of dates:
>
> Postgres
>
> Unsupported:
>
> postgres@366ecc8a0fb9:/$ psql
> psql (12.3 (Debian 12.3-1.pgdg100+1))
> Type "help" for help.
>
> postgres=# SELECT CAST(CAST('2020-01-01' AS DATE) AS decimal);
> ERROR:  cannot cast type date to numeric
> LINE 1: SELECT CAST(CAST('2020-01-01' AS DATE) AS decimal);
>                ^
>
> postgres=# SELECT CAST(CAST('2020-01-01' AS DATE) AS integer);
> ERROR:  cannot cast type date to integer
> LINE 1: SELECT CAST(CAST('2020-01-01' AS DATE) AS integer);
>                ^
>
> The way to get the epoch in days is:
>
> postgres=# SELECT EXTRACT(DAYS FROM (now() - '1970-01-01'));
> date_part
> -----------
>     18422
> (1 row)
>
>
> MySQL
>
> Converts to a YYYYMMDD format:
>
> mysql> SELECT CAST(CAST('2020-01-01' AS DATE) AS decimal);
> +---------------------------------------------+
> | CAST(CAST('2020-01-01' AS DATE) AS decimal) |
> +---------------------------------------------+
> |                                    20200101 |
> +---------------------------------------------+
> 1 row in set (0.00 sec)
>
> However, converting to an int, isn't allowed:
>
> mysql> SELECT CAST(CAST('2020-01-01' AS DATE) AS int);
> ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds
to your MySQL server version for the right syntax to use near 'int)' at line 1
>
> mysql> SELECT CAST(CAST('2020-01-01' AS DATE) AS bigint);
> ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds
to your MySQL server version for the right syntax to use near 'bigint)' at line 1
>
> Bigquery
>
> Unsupported:
>
> [image: image.png]
>
> Excel
>
> Converts it to the days since epoch. This feels weird, but I can see it,
> as it is being used as a physical format internally in many data formats.
>
> [image: image.png]
>
> For me, returning a Date as the output of avg(date) seems like a logical
> choice. Internally it is handled as dates since epoch, which makes sense as
> well:
>
> *Avro* it is milliseconds since epoch:
>
> https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/reflect/DateAsLongEncoding.java
>
> *Parquet* it is days since epoch:
> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#date
>
> *ORC* is based around days since Epoch:
>
> https://github.com/apache/orc/blob/master/java/core/src/java/org/threeten/extra/chrono/HybridDate.java
>
> Also with this, we keep parity with the Catalyst type :)
>
> Any further thoughts on this before moving forward?
>
> Kind regards, Fokko
>
>
>

Mime
View raw message