spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Driesprong, Fokko" <fo...@driesprong.frl>
Subject Re: Allow average out of a Date
Date Mon, 24 Aug 2020 07:56:42 GMT
> +1, we don't need to complicate the type system and claim that
date/timestamp can do avg/sum/etc. , we can just turn date/timestamp to
numeric, do whatever you want, and turn the result back to date/timestamp,
in the `summary` function.

+1 This is exactly my proposal in: [SPARK-10520][SQL] Allow average out of
DateType <https://github.com/apache/spark/pull/28754>

This will return a date out of avg(date), so the user isn't confused
because it gets to see an arbitrary number (in this case, days since epoch).

As a sidestep, I've illustrated the behaviour of avg(datetime), which
returns the seconds since epoch, but should als return a datetime in my
opinion: [SPARK-31981][SQL] Keep TimestampType when taking an average of a
Timestamp <https://github.com/apache/spark/pull/28821> However, changing
this means changing the public api :'( Feel free to chim in on the PR if
you're interested.

My proposal is to keep the input and output type the same.

Op ma 24 aug. 2020 om 05:04 schreef Wenchen Fan <cloud0fan@gmail.com>:

> > can't we just teach that function how to do this using existing
> functionality w/o exposing all this internal numeric representation for
> dates?
>
> +1, we don't need to complicate the type system and claim that
> date/timestamp can do avg/sum/etc. , we can just turn date/timestamp to
> numeric, do whatever you want, and turn the result back to date/timestamp,
> in the `summary` function.
>
> On Mon, Aug 24, 2020 at 3:37 AM Greg Rahn <greg.rahn@gmail.com> wrote:
>
>> I think your "working" timestamp example exposes the crux of the implicit
>> casting problem -- and PR#28554
>> <https://github.com/apache/spark/pull/28554> points to one of the other
>> prominent issues -- function return types.  The current
>> implementation seems to be problematic to me because there should not be
>> implicit conversions of timestamp types to double because there is no
>> single definition of how to represent a timestamp in double format.  It's
>> fine to have explicit casts or helper functions -- because then the user is
>> consciously choosing the "units" being used for a given numerical
>> representation.
>>
>> If the goal here is for the summary to work on dates, can't we just teach
>> that function how to do this using existing functionality w/o exposing all
>> this internal numeric representation for dates?
>>
>> Just to be clear -- I have no objection for the objective here -- I just
>> have concerns about the proposed implementation and the casting impacts.
>>
>> On Sat, Aug 22, 2020 at 12:28 PM Driesprong, Fokko <fokko@driesprong.frl>
>> wrote:
>>
>>> Thanks, Greg for your elaborate response, and sharing the table.
>>>
>>> The open PR is targetting SPARK-10520:
>>> https://github.com/apache/spark/pull/28554 Before opening a new ticket,
>>> I noticed that this was already an open issue. However, this will make it
>>> useable for Python/Java/Scala as well.
>>>
>>> Not being able to sum dates is still an open issue. For the current
>>> solution I've taken inspiration from the timestamp implementation in Spark:
>>>
>>> CREATE VIEW timestamps
>>>     AS SELECT explode(array(
>>>                   CAST('2020-07-01 12:00:00' AS timestamp),
>>>                   CAST('2020-07-02 18:00:00' AS timestamp)
>>>               )) AS a;
>>>
>>> SELECT SUM(a), SUM(a) / COUNT(a), AVG(a) FROM timestamps;
>>>
>>> Sum:         3.1873032E9
>>> Sum / Count: 1.5936516E9
>>> Avg:         1.5936516E9
>>>
>>> Using Spark 2.4.5. The datatype is implicitly converted a
>>> DoubleType (seconds since epoch), while I believe this should be an
>>> explicit choice. I've started the discussion
>>> <https://github.com/apache/spark/pull/28821> in a separate PR.
>>>
>>> I fully agree that summing/subtracting dates/timestamps don't look very
>>> sensible. However, taking an average can be of use. For example, if you're
>>> sessionising user activity or log entries, you might want to have a single
>>> date/timestamp to represent the session. Taking an average (or a median?)
>>> would make sense in such a case.
>>>
>>> > As you might know, AVG() is the equivalent of SUM()/COUNT() -- so if
>>> AVG() works on DATE, then SUM() (and the plus operator) would need to work
>>> as well -- but what is the definition of SUM() for a list of DATEs or
>>> DATE + DATE? There isn't any such definition in the SQL world. And take
>>> a very simple example -- what's the AVG() of today and yesterday? If you
>>> represent these dates as their Julian integer value, the average of two
>>> consecutive integers is a floating point number between them -- so now
>>> what?  Should it be ceil() or floor()? Round down or round up? Should it be
>>> yesterday or today?  It seems very challenging to deal with this in a
>>> formal and standardized way.
>>>
>>> I would not bother the end-user with all these questions, as I would
>>> expect in most cases this isn't an issue. If the user wants to have full
>>> control, it can first cast the date to a double, which gives days since
>>> epoch. Then it can have full control over possible rounding etc. Next step
>>> would be taking the std-dev out of a Date column, I would suggest using an
>>> interval type to represent the deviation, but let's take one step at the
>>> time.
>>>
>>> > It's my understanding that it's this ambiguity, and the need for an
>>> intermediate numerical representation like the Julian Day Number, that
>>> limits this from being an implicit SQL data type conversion operation and
>>> why users are forced to choose their own explicit conversion (to Julian
>>> date, for example) in order to do such mathematical operations. As we see
>>> in the SQL:2016 standard cast specification and conversion table (below) --
>>> there is no standardized way to implicitly cast between DATE and numeric
>>> types because there is not just one numerical representation of a DATE.
>>>
>>> For now, I've decided to stick with the days since Unix epoch, as it
>>> seems to be popular among file formats:
>>>
>>> *Avro* it is milliseconds since Unix epoch:
>>> https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/reflect/DateAsLongEncoding.java
>>> *Parquet* it is days since Unix epoch:
>>> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#date
>>> *ORC* is days since Unix epoch:
>>> https://github.com/apache/orc/blob/master/java/core/src/java/org/threeten/extra/chrono/HybridDate.java
>>> *Spark *is days since Unix epoch:
>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DateType.scala#L36
>>>
>>> > If possible, this would seem to yield the desired behavior for the use
>>> case in question, but not broaden the requirement so wide that it becomes
>>> too challenging to implement and conflicts with other well defined
>>> standards, like ANSI SQL:2016.
>>>
>>> I see your point, and I don't like to reinvent the standards. But the
>>> ANSI standard is clearly too limited, as you might also want to do things
>>> that you normally wouldn't do in a database (for example, when it comes to
>>> feature generation or the abovementioned sessionising). While looking at
>>> other RDBM'ses, it is clear that there isn't a uniform way of doing this,
>>> as shown in my initial mail. For example, you won't be able to sum the cast
>>> output (YYYYMMDD) format of MySQL, because you might get invalid
>>> months/days in the output (oktober+oktober=month 20?).
>>>
>>> Currently, it isn't possible to sum dates, if this is something that you
>>> want to have in there, I can add it to the PR. For the .summary() function
>>> it isn't required. The workaround is to cast it to a double first and then
>>> sum the days since epoch.
>>>
>>> Kind regards, Fokko
>>>
>>>
>>> Op vr 21 aug. 2020 om 18:13 schreef Greg Rahn <greg.rahn@gmail.com>:
>>>
>>>> Fokko-
>>>>
>>>> I can certainly understand the ask for this behavior and it seems
>>>> simple and/or straightforward at face value -- as R's Date Class behaves
>>>> this way, however, I think there are several challenges involved here, as
>>>> AFAIK, there seems to be no well defined way SQL systems do this implicitly
>>>> today -- and I think there is a clear explanation of why.
>>>>
>>>> As you might know, AVG() is the equivalent of SUM()/COUNT() -- so if
>>>> AVG() works on DATE, then SUM() (and the plus operator) would need to work
>>>> as well -- but what is the definition of SUM() for a list of DATEs or
>>>> DATE + DATE? There isn't any such definition in the SQL world. And
>>>> take a very simple example -- what's the AVG() of today and yesterday? If
>>>> you represent these dates as their Julian integer value, the average of two
>>>> consecutive integers is a floating point number between them -- so now
>>>> what?  Should it be ceil() or floor()? Round down or round up? Should it
be
>>>> yesterday or today?  It seems very challenging to deal with this in a
>>>> formal and standardized way.
>>>>
>>>> It's my understanding that it's this ambiguity, and the need for an
>>>> intermediate numerical representation like the Julian Day Number, that
>>>> limits this from being an implicit SQL data type conversion operation and
>>>> why users are forced to choose their own explicit conversion (to Julian
>>>> date, for example) in order to do such mathematical operations. As we see
>>>> in the SQL:2016 standard cast specification and conversion table (below)
--
>>>> there is no standardized way to implicitly cast between DATE and numeric
>>>> types because there is not just one numerical representation of a DATE.
>>>>
>>>> If we look back at the use case from SPARK-10520 which is about R's
>>>> summary function on a vector of dates -- (and I don't claim to be an R
>>>> internals expert) but maybe it makes more sense to implement R's Date Class
>>>> behavior in SparkR vs more generically across multiple languages and
>>>> types.  If possible, this would seem to yield the desired behavior for the
>>>> use case in question, but not broaden the requirement so wide that it
>>>> becomes too challenging to implement and conflicts with other well defined
>>>> standards, like ANSI SQL:2016.
>>>>
>>>>
>>>> [image: Screen Shot 2020-08-21 at 12.00.01 PM.png]
>>>>
>>>>
>>>> For reference:
>>>> https://github.com/apache/spark/pull/28754
>>>> https://issues.apache.org/jira/browse/SPARK-10520
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Aug 19, 2020 at 9:34 PM Driesprong, Fokko <fokko@driesprong.frl>
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> Personally, I'm a big fan of the .summary() function to compute
>>>>> statistics of a dataframe. I often use this for debugging pipelines,
and
>>>>> check what the impact of the RDD is after changing code.
>>>>>
>>>>> I've noticed that not all datatypes are in this summary. Currently,
>>>>> there is a list
>>>>> <https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala#L267>
>>>>> of types that allowed to be included in the summary, and I love to extend
>>>>> that list.
>>>>>
>>>>> The first one is the date type. It is important to define this
>>>>> together with the community, and that we get consensus, as this will
be
>>>>> part of the public API. Changing this will be costly (or maybe impossible)
>>>>> to do.
>>>>>
>>>>> I've checked what other DBMS'es do with averages out of dates:
>>>>>
>>>>> Postgres
>>>>>
>>>>> Unsupported:
>>>>>
>>>>> postgres@366ecc8a0fb9:/$ psql
>>>>> psql (12.3 (Debian 12.3-1.pgdg100+1))
>>>>> Type "help" for help.
>>>>>
>>>>> postgres=# SELECT CAST(CAST('2020-01-01' AS DATE) AS decimal);
>>>>> ERROR:  cannot cast type date to numeric
>>>>> LINE 1: SELECT CAST(CAST('2020-01-01' AS DATE) AS decimal);
>>>>>                ^
>>>>>
>>>>> postgres=# SELECT CAST(CAST('2020-01-01' AS DATE) AS integer);
>>>>> ERROR:  cannot cast type date to integer
>>>>> LINE 1: SELECT CAST(CAST('2020-01-01' AS DATE) AS integer);
>>>>>                ^
>>>>>
>>>>> The way to get the epoch in days is:
>>>>>
>>>>> postgres=# SELECT EXTRACT(DAYS FROM (now() - '1970-01-01'));
>>>>> date_part
>>>>> -----------
>>>>>     18422
>>>>> (1 row)
>>>>>
>>>>>
>>>>> MySQL
>>>>>
>>>>> Converts to a YYYYMMDD format:
>>>>>
>>>>> mysql> SELECT CAST(CAST('2020-01-01' AS DATE) AS decimal);
>>>>> +---------------------------------------------+
>>>>> | CAST(CAST('2020-01-01' AS DATE) AS decimal) |
>>>>> +---------------------------------------------+
>>>>> |                                    20200101 |
>>>>> +---------------------------------------------+
>>>>> 1 row in set (0.00 sec)
>>>>>
>>>>> However, converting to an int, isn't allowed:
>>>>>
>>>>> mysql> SELECT CAST(CAST('2020-01-01' AS DATE) AS int);
>>>>> ERROR 1064 (42000): You have an error in your SQL syntax; check the manual
that corresponds to your MySQL server version for the right syntax to use near 'int)' at line
1
>>>>>
>>>>> mysql> SELECT CAST(CAST('2020-01-01' AS DATE) AS bigint);
>>>>> ERROR 1064 (42000): You have an error in your SQL syntax; check the manual
that corresponds to your MySQL server version for the right syntax to use near 'bigint)' at
line 1
>>>>>
>>>>> Bigquery
>>>>>
>>>>> Unsupported:
>>>>>
>>>>> [image: image.png]
>>>>>
>>>>> Excel
>>>>>
>>>>> Converts it to the days since epoch. This feels weird, but I can see
>>>>> it, as it is being used as a physical format internally in many data
>>>>> formats.
>>>>>
>>>>> [image: image.png]
>>>>>
>>>>> For me, returning a Date as the output of avg(date) seems like a
>>>>> logical choice. Internally it is handled as dates since epoch, which
makes
>>>>> sense as well:
>>>>>
>>>>> *Avro* it is milliseconds since epoch:
>>>>>
>>>>> https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/reflect/DateAsLongEncoding.java
>>>>>
>>>>> *Parquet* it is days since epoch:
>>>>>
>>>>> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#date
>>>>>
>>>>> *ORC* is based around days since Epoch:
>>>>>
>>>>> https://github.com/apache/orc/blob/master/java/core/src/java/org/threeten/extra/chrono/HybridDate.java
>>>>>
>>>>> Also with this, we keep parity with the Catalyst type :)
>>>>>
>>>>> Any further thoughts on this before moving forward?
>>>>>
>>>>> Kind regards, Fokko
>>>>>
>>>>>
>>>>>

Mime
View raw message