spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aram Mkrtchyan <aram.mkrtchyan...@gmail.com>
Subject Re: How to add a new column with date duration from 2 date columns in a dataframe
Date Thu, 20 Aug 2015 14:45:38 GMT
Hi,

hope this will help you

    import org.apache.spark.sql.functions._
    import sqlContext.implicits._
    import java.sql.Timestamp

    val df = sc.parallelize(Array((date1, date2))).toDF("day1", "day2")

    val dateDiff = udf[Long, Timestamp, Timestamp]((value1, value2) =>
          Days.daysBetween(new DateTime(value2.getTime), new
DateTime(value1.getTime)).getDays)
    df.withColumn("diff", dateDiff(df("day2"), df("day1"))).show()

or you can write sql query using hiveql's datediff function.
     https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

On Thu, Aug 20, 2015 at 4:57 PM, Dhaval Patel <dhaval1406@gmail.com> wrote:

> More update on this question......I am using spark 1.4.1.
>
> I was just reading documentation of spark 1.5 (still in development) and I
> think there will be a new func *datediff* that will solve the issue. So
> please let me know if there is any work-around until spark 1.5 is out :).
>
> pyspark.sql.functions.datediff(*end*, *start*)[source]
> <http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-preview-20150812-docs/api/python/_modules/pyspark/sql/functions.html#datediff>
> <http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-preview-20150812-docs/api/python/pyspark.sql.html#pyspark.sql.functions.datediff>
>
> Returns the number of days from start to end.
>
> >>> df = sqlContext.createDataFrame([('2015-04-08','2015-05-10')], ['d1', 'd2'])>>>
df.select(datediff(df.d2, df.d1).alias('diff')).collect()[Row(diff=32)]
>
> New in version 1.5.
>
> On Thu, Aug 20, 2015 at 8:26 AM, Dhaval Patel <dhaval1406@gmail.com>
> wrote:
>
>> Apologies, sent too early accidentally. Actual message is below
>> ========================================================
>>
>> A dataframe has 2 datecolumns (datetime type) and I would like to add
>> another column that would have difference between these two dates.
>> Dataframe snippet is below.
>>
>> new_df.show(5)
>> +-----------+----------+--------------+
>> | PATID| SVCDATE|next_diag_date|
>> +-----------+----------+--------------+
>> |12345655545|2012-02-13| 2012-02-13|
>> |12345655545|2012-02-13| 2012-02-13|
>> |12345655545|2012-02-13| 2012-02-27|
>> +-----------+----------+--------------+
>>
>>
>>
>> Here is what I have tried so far:
>>
>> -> new_df.withColumn('SVCDATE2',
>> (new_df.next_diag_date-new_df.SVCDATE)).show()
>>     Error: DateType does not support numeric operations
>>
>> -> new_df.withColumn('SVCDATE2',
>> (new_df.next_diag_date-new_df.SVCDATE).days).show()
>>     Error: Can't extract value from (next_diag_date#927 - SVCDATE#377);
>>
>>
>> However this simple python code works fine with pySpark:
>>
>> from datetime import date
>> d0 = date(2008, 8, 18)
>> d1 = date(2008, 9, 26)
>> delta = d0 - d1
>> print (d0 - d1).days
>>
>> # -39
>>
>>
>> Any suggestions would be appreciated! Also is there a way to add a new
>> column in dataframe without using column expression (e.g. like in pandas or
>> R. df$new_col = 'new col value')?
>>
>>
>> Thanks,
>> Dhaval
>>
>>
>>
>> On Thu, Aug 20, 2015 at 8:18 AM, Dhaval Patel <dhaval1406@gmail.com>
>> wrote:
>>
>>> new_df.withColumn('SVCDATE2',
>>> (new_df.next_diag_date-new_df.SVCDATE).days).show()
>>>
>>> +-----------+----------+--------------+ | PATID| SVCDATE|next_diag_date|
>>> +-----------+----------+--------------+ |12345655545|2012-02-13|
>>> 2012-02-13| |12345655545|2012-02-13| 2012-02-13| |12345655545|2012-02-13|
>>> 2012-02-27| +-----------+----------+--------------+
>>>
>>
>>
>

Mime
View raw message