spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Enrico Minack <m...@Enrico.Minack.dev>
Subject Re: comparable and orderable CalendarInterval
Date Thu, 05 Mar 2020 10:36:34 GMT
There is another feature missing for CalendarInterval, which is related 
to comparability: measure the length of an interval.

Would be nice if you could access the length of an interval, than you 
could compute something like this:

|Seq((Timestamp.valueOf("2020-02-01 12:00:00"), 
Timestamp.valueOf("2020-02-01 13:30:25"))) .toDF("start", "end") 
.withColumn("interval", $"end" - $"start") .withColumn("interval [h]", 
*INTERVAL LENGTH IN HOURS*) .withColumn("rate [€/h]", lit(1.45)) 
.withColumn("price [€]", $"interval [h]" * $"rate [€/h]") .show(false) 
+-------------------+-------------------+-----------------------------+------------------+----------+------------------+

|start |end |interval |interval [h] |rate [€/h]|price [€] | 
+-------------------+-------------------+-----------------------------+------------------+----------+------------------+

|2020-02-01 12:00:00|2020-02-01 13:30:25|1 hours 30 minutes 25 
seconds|1.5069444444444444|1.45 |2.1850694444444443| 
+-------------------+-------------------+-----------------------------+------------------+----------+------------------+

|


The length of an interval can be measured by dividing it with the length 
of your measuring unit, e.g. "1 hour":

||$"interval" / lit("1 hour").cast(CalendarIntervalType)| |


Which brings us to CalendarInterval division: 
https://github.com/apache/spark/pull/27805

Enrico


Am 11.02.20 um 21:09 schrieb Enrico Minack:
> I compute the difference of two timestamps and compare them with a 
> constant interval:
>
> Seq(("2019-01-02 12:00:00", "2019-01-02 13:30:00"))
>   .toDF("start", "end")
> .select($"start".cast(TimestampType), $"end".cast(TimestampType))
>   .select($"start", $"end", ($"end" - $"start").as("diff"))
>   .where($"diff" < lit("INTERVAL 2 HOUR").cast(CalendarIntervalType))
>   .show
>
> Coming from timestamps, the interval should have correct hours 
> (millisecond component), so comparing it with the "right kinds of 
> intervals" should always be correct.
>
> Enrico
>
>
> Am 11.02.20 um 17:06 schrieb Wenchen Fan:
>> What's your use case to compare intervals? It's tricky in Spark as 
>> there is only one interval type and you can't really compare one 
>> month with 30 days.
>>
>> On Wed, Feb 12, 2020 at 12:01 AM Enrico Minack 
>> <mail@enrico.minack.dev <mailto:mail@enrico.minack.dev>> wrote:
>>
>>     Hi Devs,
>>
>>     I would like to know what is the current roadmap of making
>>     CalendarInterval comparable and orderable again (SPARK-29679,
>>     SPARK-29385, #26337).
>>
>>     With #27262, this got reverted but SPARK-30551 does not mention
>>     how to
>>     go forward in this matter. I have found SPARK-28494, but this
>>     seems to
>>     be stale.
>>
>>     While I find it useful to compare such intervals, I cannot find a
>>     way to
>>     work around the missing comparability. Is there a way to get,
>>     e.g. the
>>     seconds that an interval represents to be able to compare
>>     intervals? In
>>     org.apache.spark.sql.catalyst.util.IntervalUtils there are
>>     methods like
>>     getEpoch or getDuration, which I cannot see are exposed to SQL or
>>     in the
>>     org.apache.spark.sql.functions package.
>>
>>     Thanks for the insights,
>>     Enrico
>>
>>
>>     ---------------------------------------------------------------------
>>     To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>     <mailto:dev-unsubscribe@spark.apache.org>
>>
>


Mime
View raw message