spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Davies Liu <dav...@databricks.com>
Subject Re: How to add a new column with date duration from 2 date columns in a dataframe
Date Thu, 20 Aug 2015 19:02:04 GMT
As Aram said, there two options in Spark 1.4,

1) Use the HiveContext, then you got datediff from Hive,
df.selectExpr("datediff(d2, d1)")
2) Use Python UDF:
```
>>> from datetime import date
>>> df = sqlContext.createDataFrame([(date(2008, 8, 18), date(2008, 9, 26))], ['d1',
'd2'])
>>> from pyspark.sql.functions import udf
>>> from pyspark.sql.types import IntegerType
>>> diff = udf(lambda a, b: (a - b).days, IntegerType())
>>> df.select(diff(df.d1, df.d2)).show()
+-------------------------+
|PythonUDF#<lambda>(d1,d2)|
+-------------------------+
|                      -39|
+-------------------------+
```

On Thu, Aug 20, 2015 at 7:45 AM, Aram Mkrtchyan
<aram.mkrtchyan.87@gmail.com> wrote:
> Hi,
>
> hope this will help you
>
>     import org.apache.spark.sql.functions._
>     import sqlContext.implicits._
>     import java.sql.Timestamp
>
>     val df = sc.parallelize(Array((date1, date2))).toDF("day1", "day2")
>
>     val dateDiff = udf[Long, Timestamp, Timestamp]((value1, value2) =>
>           Days.daysBetween(new DateTime(value2.getTime), new
> DateTime(value1.getTime)).getDays)
>     df.withColumn("diff", dateDiff(df("day2"), df("day1"))).show()
>
> or you can write sql query using hiveql's datediff function.
>      https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF
>
> On Thu, Aug 20, 2015 at 4:57 PM, Dhaval Patel <dhaval1406@gmail.com> wrote:
>>
>> More update on this question......I am using spark 1.4.1.
>>
>> I was just reading documentation of spark 1.5 (still in development) and I
>> think there will be a new func *datediff* that will solve the issue. So
>> please let me know if there is any work-around until spark 1.5 is out :).
>>
>> pyspark.sql.functions.datediff(end, start)[source]
>>
>> Returns the number of days from start to end.
>>
>> >>> df = sqlContext.createDataFrame([('2015-04-08','2015-05-10')], ['d1',
>> >>> 'd2'])
>> >>> df.select(datediff(df.d2, df.d1).alias('diff')).collect()
>> [Row(diff=32)]
>>
>> New in version 1.5.
>>
>>
>> On Thu, Aug 20, 2015 at 8:26 AM, Dhaval Patel <dhaval1406@gmail.com>
>> wrote:
>>>
>>> Apologies, sent too early accidentally. Actual message is below
>>> ========================================================
>>>
>>> A dataframe has 2 datecolumns (datetime type) and I would like to add
>>> another column that would have difference between these two dates. Dataframe
>>> snippet is below.
>>>
>>> new_df.show(5)
>>> +-----------+----------+--------------+
>>> | PATID| SVCDATE|next_diag_date|
>>> +-----------+----------+--------------+
>>> |12345655545|2012-02-13| 2012-02-13|
>>> |12345655545|2012-02-13| 2012-02-13|
>>> |12345655545|2012-02-13| 2012-02-27|
>>> +-----------+----------+--------------+
>>>
>>>
>>>
>>> Here is what I have tried so far:
>>>
>>> -> new_df.withColumn('SVCDATE2',
>>> (new_df.next_diag_date-new_df.SVCDATE)).show()
>>>     Error: DateType does not support numeric operations
>>>
>>> -> new_df.withColumn('SVCDATE2',
>>> (new_df.next_diag_date-new_df.SVCDATE).days).show()
>>>     Error: Can't extract value from (next_diag_date#927 - SVCDATE#377);
>>>
>>>
>>> However this simple python code works fine with pySpark:
>>>
>>> from datetime import date
>>> d0 = date(2008, 8, 18)
>>> d1 = date(2008, 9, 26)
>>> delta = d0 - d1
>>> print (d0 - d1).days
>>>
>>> # -39
>>>
>>>
>>> Any suggestions would be appreciated! Also is there a way to add a new
>>> column in dataframe without using column expression (e.g. like in pandas or
>>> R. df$new_col = 'new col value')?
>>>
>>>
>>> Thanks,
>>> Dhaval
>>>
>>>
>>>
>>> On Thu, Aug 20, 2015 at 8:18 AM, Dhaval Patel <dhaval1406@gmail.com>
>>> wrote:
>>>>
>>>> new_df.withColumn('SVCDATE2',
>>>> (new_df.next_diag_date-new_df.SVCDATE).days).show()
>>>>
>>>> +-----------+----------+--------------+ | PATID| SVCDATE|next_diag_date|
>>>> +-----------+----------+--------------+ |12345655545|2012-02-13| 2012-02-13|
>>>> |12345655545|2012-02-13| 2012-02-13| |12345655545|2012-02-13| 2012-02-27|
>>>> +-----------+----------+--------------+
>>>
>>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message