spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Greg Rahn <greg.r...@gmail.com>
Subject Re: Allow average out of a Date
Date Sun, 23 Aug 2020 19:36:40 GMT
I think your "working" timestamp example exposes the crux of the implicit
casting problem -- and PR#28554
<https://github.com/apache/spark/pull/28554> points
to one of the other prominent issues -- function return types.  The current
implementation seems to be problematic to me because there should not be
implicit conversions of timestamp types to double because there is no
single definition of how to represent a timestamp in double format.  It's
fine to have explicit casts or helper functions -- because then the user is
consciously choosing the "units" being used for a given numerical
representation.

If the goal here is for the summary to work on dates, can't we just teach
that function how to do this using existing functionality w/o exposing all
this internal numeric representation for dates?

Just to be clear -- I have no objection for the objective here -- I just
have concerns about the proposed implementation and the casting impacts.

On Sat, Aug 22, 2020 at 12:28 PM Driesprong, Fokko <fokko@driesprong.frl>
wrote:

> Thanks, Greg for your elaborate response, and sharing the table.
>
> The open PR is targetting SPARK-10520:
> https://github.com/apache/spark/pull/28554 Before opening a new ticket, I
> noticed that this was already an open issue. However, this will make it
> useable for Python/Java/Scala as well.
>
> Not being able to sum dates is still an open issue. For the current
> solution I've taken inspiration from the timestamp implementation in Spark:
>
> CREATE VIEW timestamps
>     AS SELECT explode(array(
>                   CAST('2020-07-01 12:00:00' AS timestamp),
>                   CAST('2020-07-02 18:00:00' AS timestamp)
>               )) AS a;
>
> SELECT SUM(a), SUM(a) / COUNT(a), AVG(a) FROM timestamps;
>
> Sum:         3.1873032E9
> Sum / Count: 1.5936516E9
> Avg:         1.5936516E9
>
> Using Spark 2.4.5. The datatype is implicitly converted a
> DoubleType (seconds since epoch), while I believe this should be an
> explicit choice. I've started the discussion
> <https://github.com/apache/spark/pull/28821> in a separate PR.
>
> I fully agree that summing/subtracting dates/timestamps don't look very
> sensible. However, taking an average can be of use. For example, if you're
> sessionising user activity or log entries, you might want to have a single
> date/timestamp to represent the session. Taking an average (or a median?)
> would make sense in such a case.
>
> > As you might know, AVG() is the equivalent of SUM()/COUNT() -- so if
> AVG() works on DATE, then SUM() (and the plus operator) would need to work
> as well -- but what is the definition of SUM() for a list of DATEs or
> DATE + DATE? There isn't any such definition in the SQL world. And take a
> very simple example -- what's the AVG() of today and yesterday? If you
> represent these dates as their Julian integer value, the average of two
> consecutive integers is a floating point number between them -- so now
> what?  Should it be ceil() or floor()? Round down or round up? Should it be
> yesterday or today?  It seems very challenging to deal with this in a
> formal and standardized way.
>
> I would not bother the end-user with all these questions, as I would
> expect in most cases this isn't an issue. If the user wants to have full
> control, it can first cast the date to a double, which gives days since
> epoch. Then it can have full control over possible rounding etc. Next step
> would be taking the std-dev out of a Date column, I would suggest using an
> interval type to represent the deviation, but let's take one step at the
> time.
>
> > It's my understanding that it's this ambiguity, and the need for an
> intermediate numerical representation like the Julian Day Number, that
> limits this from being an implicit SQL data type conversion operation and
> why users are forced to choose their own explicit conversion (to Julian
> date, for example) in order to do such mathematical operations. As we see
> in the SQL:2016 standard cast specification and conversion table (below) --
> there is no standardized way to implicitly cast between DATE and numeric
> types because there is not just one numerical representation of a DATE.
>
> For now, I've decided to stick with the days since Unix epoch, as it seems
> to be popular among file formats:
>
> *Avro* it is milliseconds since Unix epoch:
> https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/reflect/DateAsLongEncoding.java
> *Parquet* it is days since Unix epoch:
> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#date
> *ORC* is days since Unix epoch:
> https://github.com/apache/orc/blob/master/java/core/src/java/org/threeten/extra/chrono/HybridDate.java
> *Spark *is days since Unix epoch:
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DateType.scala#L36
>
> > If possible, this would seem to yield the desired behavior for the use
> case in question, but not broaden the requirement so wide that it becomes
> too challenging to implement and conflicts with other well defined
> standards, like ANSI SQL:2016.
>
> I see your point, and I don't like to reinvent the standards. But the ANSI
> standard is clearly too limited, as you might also want to do things that
> you normally wouldn't do in a database (for example, when it comes to
> feature generation or the abovementioned sessionising). While looking at
> other RDBM'ses, it is clear that there isn't a uniform way of doing this,
> as shown in my initial mail. For example, you won't be able to sum the cast
> output (YYYYMMDD) format of MySQL, because you might get invalid
> months/days in the output (oktober+oktober=month 20?).
>
> Currently, it isn't possible to sum dates, if this is something that you
> want to have in there, I can add it to the PR. For the .summary() function
> it isn't required. The workaround is to cast it to a double first and then
> sum the days since epoch.
>
> Kind regards, Fokko
>
>
> Op vr 21 aug. 2020 om 18:13 schreef Greg Rahn <greg.rahn@gmail.com>:
>
>> Fokko-
>>
>> I can certainly understand the ask for this behavior and it seems simple
>> and/or straightforward at face value -- as R's Date Class behaves this way,
>> however, I think there are several challenges involved here, as AFAIK,
>> there seems to be no well defined way SQL systems do this implicitly today
>> -- and I think there is a clear explanation of why.
>>
>> As you might know, AVG() is the equivalent of SUM()/COUNT() -- so if
>> AVG() works on DATE, then SUM() (and the plus operator) would need to work
>> as well -- but what is the definition of SUM() for a list of DATEs or
>> DATE + DATE? There isn't any such definition in the SQL world. And take
>> a very simple example -- what's the AVG() of today and yesterday? If you
>> represent these dates as their Julian integer value, the average of two
>> consecutive integers is a floating point number between them -- so now
>> what?  Should it be ceil() or floor()? Round down or round up? Should it be
>> yesterday or today?  It seems very challenging to deal with this in a
>> formal and standardized way.
>>
>> It's my understanding that it's this ambiguity, and the need for an
>> intermediate numerical representation like the Julian Day Number, that
>> limits this from being an implicit SQL data type conversion operation and
>> why users are forced to choose their own explicit conversion (to Julian
>> date, for example) in order to do such mathematical operations. As we see
>> in the SQL:2016 standard cast specification and conversion table (below) --
>> there is no standardized way to implicitly cast between DATE and numeric
>> types because there is not just one numerical representation of a DATE.
>>
>> If we look back at the use case from SPARK-10520 which is about R's
>> summary function on a vector of dates -- (and I don't claim to be an R
>> internals expert) but maybe it makes more sense to implement R's Date Class
>> behavior in SparkR vs more generically across multiple languages and
>> types.  If possible, this would seem to yield the desired behavior for the
>> use case in question, but not broaden the requirement so wide that it
>> becomes too challenging to implement and conflicts with other well defined
>> standards, like ANSI SQL:2016.
>>
>>
>> [image: Screen Shot 2020-08-21 at 12.00.01 PM.png]
>>
>>
>> For reference:
>> https://github.com/apache/spark/pull/28754
>> https://issues.apache.org/jira/browse/SPARK-10520
>>
>>
>>
>>
>> On Wed, Aug 19, 2020 at 9:34 PM Driesprong, Fokko <fokko@driesprong.frl>
>> wrote:
>>
>>> Hi all,
>>>
>>> Personally, I'm a big fan of the .summary() function to compute
>>> statistics of a dataframe. I often use this for debugging pipelines, and
>>> check what the impact of the RDD is after changing code.
>>>
>>> I've noticed that not all datatypes are in this summary. Currently,
>>> there is a list
>>> <https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala#L267>
>>> of types that allowed to be included in the summary, and I love to extend
>>> that list.
>>>
>>> The first one is the date type. It is important to define this together
>>> with the community, and that we get consensus, as this will be part of the
>>> public API. Changing this will be costly (or maybe impossible) to do.
>>>
>>> I've checked what other DBMS'es do with averages out of dates:
>>>
>>> Postgres
>>>
>>> Unsupported:
>>>
>>> postgres@366ecc8a0fb9:/$ psql
>>> psql (12.3 (Debian 12.3-1.pgdg100+1))
>>> Type "help" for help.
>>>
>>> postgres=# SELECT CAST(CAST('2020-01-01' AS DATE) AS decimal);
>>> ERROR:  cannot cast type date to numeric
>>> LINE 1: SELECT CAST(CAST('2020-01-01' AS DATE) AS decimal);
>>>                ^
>>>
>>> postgres=# SELECT CAST(CAST('2020-01-01' AS DATE) AS integer);
>>> ERROR:  cannot cast type date to integer
>>> LINE 1: SELECT CAST(CAST('2020-01-01' AS DATE) AS integer);
>>>                ^
>>>
>>> The way to get the epoch in days is:
>>>
>>> postgres=# SELECT EXTRACT(DAYS FROM (now() - '1970-01-01'));
>>> date_part
>>> -----------
>>>     18422
>>> (1 row)
>>>
>>>
>>> MySQL
>>>
>>> Converts to a YYYYMMDD format:
>>>
>>> mysql> SELECT CAST(CAST('2020-01-01' AS DATE) AS decimal);
>>> +---------------------------------------------+
>>> | CAST(CAST('2020-01-01' AS DATE) AS decimal) |
>>> +---------------------------------------------+
>>> |                                    20200101 |
>>> +---------------------------------------------+
>>> 1 row in set (0.00 sec)
>>>
>>> However, converting to an int, isn't allowed:
>>>
>>> mysql> SELECT CAST(CAST('2020-01-01' AS DATE) AS int);
>>> ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that
corresponds to your MySQL server version for the right syntax to use near 'int)' at line 1
>>>
>>> mysql> SELECT CAST(CAST('2020-01-01' AS DATE) AS bigint);
>>> ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that
corresponds to your MySQL server version for the right syntax to use near 'bigint)' at line
1
>>>
>>> Bigquery
>>>
>>> Unsupported:
>>>
>>> [image: image.png]
>>>
>>> Excel
>>>
>>> Converts it to the days since epoch. This feels weird, but I can see it,
>>> as it is being used as a physical format internally in many data formats.
>>>
>>> [image: image.png]
>>>
>>> For me, returning a Date as the output of avg(date) seems like a logical
>>> choice. Internally it is handled as dates since epoch, which makes sense as
>>> well:
>>>
>>> *Avro* it is milliseconds since epoch:
>>>
>>> https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/reflect/DateAsLongEncoding.java
>>>
>>> *Parquet* it is days since epoch:
>>> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#date
>>>
>>> *ORC* is based around days since Epoch:
>>>
>>> https://github.com/apache/orc/blob/master/java/core/src/java/org/threeten/extra/chrono/HybridDate.java
>>>
>>> Also with this, we keep parity with the Catalyst type :)
>>>
>>> Any further thoughts on this before moving forward?
>>>
>>> Kind regards, Fokko
>>>
>>>
>>>

Mime
View raw message