spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Apache Spark (Jira)" <j...@apache.org>
Subject [jira] [Assigned] (SPARK-31056) Add CalendarIntervals division
Date Fri, 01 May 2020 16:51:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-31056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Apache Spark reassigned SPARK-31056:
------------------------------------

    Assignee: Apache Spark

> Add CalendarIntervals division
> ------------------------------
>
>                 Key: SPARK-31056
>                 URL: https://issues.apache.org/jira/browse/SPARK-31056
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.1.0
>            Reporter: Enrico Minack
>            Assignee: Apache Spark
>            Priority: Major
>
> {{CalendarInterval}} should be allowed for division. The {{CalendarInterval}} consists
of three time components: {{months}}, {{days}} and {{microseconds}}. The division can only
be defined between intervals that have a single non-zero time component, while both intervals
have the same non-zero time component. Otherwise the division expression would be ambiguous.
> This allows to evaluate the magnitude of {{CalendarInterval}} in SQL expressions:
> {code}
> Seq((Timestamp.valueOf("2020-02-01 12:00:00"), Timestamp.valueOf("2020-02-01 13:30:25")))
>   .toDF("start", "end")
>   .withColumn("interval", $"end" - $"start")
>   .withColumn("interval [h]", $"interval" / lit("1 hour").cast(CalendarIntervalType))
>   .withColumn("rate [€/h]", lit(1.45))
>   .withColumn("price [€]", $"interval [h]" * $"rate [€/h]")
>   .show(false)
> +-------------------+-------------------+-----------------------------+------------------+----------+------------------+
> |start              |end                |interval                     |interval [h] 
    |rate [€/h]|price [€]         |
> +-------------------+-------------------+-----------------------------+------------------+----------+------------------+
> |2020-02-01 12:00:00|2020-02-01 13:30:25|1 hours 30 minutes 25 seconds|1.5069444444444444|1.45
     |2.1850694444444443|
> +-------------------+-------------------+-----------------------------+------------------+----------+------------------+
> {code}
> The currently available approach is
> {code}
> Seq((Timestamp.valueOf("2020-02-01 12:00:00"), Timestamp.valueOf("2020-02-01 13:30:25")))
>   .toDF("start", "end")
>   .withColumn("interval [s]", unix_timestamp($"end") - unix_timestamp($"start"))
>   .withColumn("interval [h]", $"interval [s]" / 3600)
>   .withColumn("rate [€/h]", lit(1.45))
>   .withColumn("price [€]", $"interval [h]" * $"rate [€/h]")
>   .show(false)
> {code}
> Going through {{unix_timestamp}} is a hack and it pollutes the SQL query with unrelated
semantics (unix timestamp is completely irrelevant for this computation). It is merely there
because there is currently no way to access the length of an {{CalendarInterval}}. Dividing
an interval by another interval provides means to measure the length in an arbitrary unit
(minutes, hours, quarter hours).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message