spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dhaval Patel <dhaval1...@gmail.com>
Subject Re: How to add a new column with date duration from 2 date columns in a dataframe
Date Thu, 20 Aug 2015 12:57:55 GMT
More update on this question......I am using spark 1.4.1.

I was just reading documentation of spark 1.5 (still in development) and I
think there will be a new func *datediff* that will solve the issue. So
please let me know if there is any work-around until spark 1.5 is out :).

pyspark.sql.functions.datediff(*end*, *start*)[source]
<http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-preview-20150812-docs/api/python/_modules/pyspark/sql/functions.html#datediff>
<http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-preview-20150812-docs/api/python/pyspark.sql.html#pyspark.sql.functions.datediff>

Returns the number of days from start to end.

>>> df = sqlContext.createDataFrame([('2015-04-08','2015-05-10')], ['d1', 'd2'])>>>
df.select(datediff(df.d2, df.d1).alias('diff')).collect()[Row(diff=32)]

New in version 1.5.

On Thu, Aug 20, 2015 at 8:26 AM, Dhaval Patel <dhaval1406@gmail.com> wrote:

> Apologies, sent too early accidentally. Actual message is below
> ========================================================
>
> A dataframe has 2 datecolumns (datetime type) and I would like to add
> another column that would have difference between these two dates.
> Dataframe snippet is below.
>
> new_df.show(5)
> +-----------+----------+--------------+
> | PATID| SVCDATE|next_diag_date|
> +-----------+----------+--------------+
> |12345655545|2012-02-13| 2012-02-13|
> |12345655545|2012-02-13| 2012-02-13|
> |12345655545|2012-02-13| 2012-02-27|
> +-----------+----------+--------------+
>
>
>
> Here is what I have tried so far:
>
> -> new_df.withColumn('SVCDATE2',
> (new_df.next_diag_date-new_df.SVCDATE)).show()
>     Error: DateType does not support numeric operations
>
> -> new_df.withColumn('SVCDATE2',
> (new_df.next_diag_date-new_df.SVCDATE).days).show()
>     Error: Can't extract value from (next_diag_date#927 - SVCDATE#377);
>
>
> However this simple python code works fine with pySpark:
>
> from datetime import date
> d0 = date(2008, 8, 18)
> d1 = date(2008, 9, 26)
> delta = d0 - d1
> print (d0 - d1).days
>
> # -39
>
>
> Any suggestions would be appreciated! Also is there a way to add a new
> column in dataframe without using column expression (e.g. like in pandas or
> R. df$new_col = 'new col value')?
>
>
> Thanks,
> Dhaval
>
>
>
> On Thu, Aug 20, 2015 at 8:18 AM, Dhaval Patel <dhaval1406@gmail.com>
> wrote:
>
>> new_df.withColumn('SVCDATE2',
>> (new_df.next_diag_date-new_df.SVCDATE).days).show()
>>
>> +-----------+----------+--------------+ | PATID| SVCDATE|next_diag_date|
>> +-----------+----------+--------------+ |12345655545|2012-02-13|
>> 2012-02-13| |12345655545|2012-02-13| 2012-02-13| |12345655545|2012-02-13|
>> 2012-02-27| +-----------+----------+--------------+
>>
>
>

Mime
View raw message