spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dhaval Patel <dhaval1...@gmail.com>
Subject Re: How to add a new column with date duration from 2 date columns in a dataframe
Date Thu, 20 Aug 2015 12:26:51 GMT
Apologies, sent too early accidentally. Actual message is below
========================================================

A dataframe has 2 datecolumns (datetime type) and I would like to add
another column that would have difference between these two dates.
Dataframe snippet is below.

new_df.show(5)
+-----------+----------+--------------+
| PATID| SVCDATE|next_diag_date|
+-----------+----------+--------------+
|12345655545|2012-02-13| 2012-02-13|
|12345655545|2012-02-13| 2012-02-13|
|12345655545|2012-02-13| 2012-02-27| +-----------+----------+--------------+



Here is what I have tried so far:

-> new_df.withColumn('SVCDATE2',
(new_df.next_diag_date-new_df.SVCDATE)).show()
    Error: DateType does not support numeric operations

-> new_df.withColumn('SVCDATE2',
(new_df.next_diag_date-new_df.SVCDATE).days).show()
    Error: Can't extract value from (next_diag_date#927 - SVCDATE#377);


However this simple python code works fine with pySpark:

from datetime import date
d0 = date(2008, 8, 18)
d1 = date(2008, 9, 26)
delta = d0 - d1
print (d0 - d1).days

# -39


Any suggestions would be appreciated! Also is there a way to add a new
column in dataframe without using column expression (e.g. like in pandas or
R. df$new_col = 'new col value')?


Thanks,
Dhaval



On Thu, Aug 20, 2015 at 8:18 AM, Dhaval Patel <dhaval1406@gmail.com> wrote:

> new_df.withColumn('SVCDATE2',
> (new_df.next_diag_date-new_df.SVCDATE).days).show()
>
> +-----------+----------+--------------+ | PATID| SVCDATE|next_diag_date|
> +-----------+----------+--------------+ |12345655545|2012-02-13|
> 2012-02-13| |12345655545|2012-02-13| 2012-02-13| |12345655545|2012-02-13|
> 2012-02-27| +-----------+----------+--------------+
>

Mime
View raw message