Apologies, sent too early accidentally. Actual message is below
========================================================

A dataframe has 2 datecolumns (datetime type) and I would like to add another column that would have difference between these two dates. Dataframe snippet is below.

new_df.show(5)
+-----------+----------+--------------+
| PATID| SVCDATE|next_diag_date|
+-----------+----------+--------------+
|12345655545|2012-02-13| 2012-02-13|
|12345655545|2012-02-13| 2012-02-13|
|12345655545|2012-02-13| 2012-02-27| +-----------+----------+--------------+


Here is what I have tried so far:

-> new_df.withColumn('SVCDATE2', (new_df.next_diag_date-new_df.SVCDATE)).show() 
    Error: DateType does not support numeric operations

-> new_df.withColumn('SVCDATE2', (new_df.next_diag_date-new_df.SVCDATE).days).show()
    Error: Can't extract value from (next_diag_date#927 - SVCDATE#377);
 
However this simple python code works fine with pySpark:

from datetime import date
d0 = date(2008, 8, 18)
d1 = date(2008, 9, 26)
delta = d0 - d1
print (d0 - d1).days

# -39

Any suggestions would be appreciated! Also is there a way to add a new column in dataframe without using column expression (e.g. like in pandas or R. df$new_col = 'new col value')?


Thanks,
Dhaval



On Thu, Aug 20, 2015 at 8:18 AM, Dhaval Patel <dhaval1406@gmail.com> wrote:
new_df.withColumn('SVCDATE2', (new_df.next_diag_date-new_df.SVCDATE).days).show()

+-----------+----------+--------------+ | PATID| SVCDATE|next_diag_date| +-----------+----------+--------------+ |12345655545|2012-02-13| 2012-02-13| |12345655545|2012-02-13| 2012-02-13| |12345655545|2012-02-13| 2012-02-27| +-----------+----------+--------------+