spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dhaval Patel <dhaval1...@gmail.com>
Subject Re: How to add a new column with date duration from 2 date columns in a dataframe
Date Wed, 26 Aug 2015 15:29:09 GMT
Thanks Davies. HiveContext seems neat to use :)

On Thu, Aug 20, 2015 at 3:02 PM, Davies Liu <davies@databricks.com> wrote:

> As Aram said, there two options in Spark 1.4,
>
> 1) Use the HiveContext, then you got datediff from Hive,
> df.selectExpr("datediff(d2, d1)")
> 2) Use Python UDF:
> ```
> >>> from datetime import date
> >>> df = sqlContext.createDataFrame([(date(2008, 8, 18), date(2008, 9,
> 26))], ['d1', 'd2'])
> >>> from pyspark.sql.functions import udf
> >>> from pyspark.sql.types import IntegerType
> >>> diff = udf(lambda a, b: (a - b).days, IntegerType())
> >>> df.select(diff(df.d1, df.d2)).show()
> +-------------------------+
> |PythonUDF#<lambda>(d1,d2)|
> +-------------------------+
> |                      -39|
> +-------------------------+
> ```
>
> On Thu, Aug 20, 2015 at 7:45 AM, Aram Mkrtchyan
> <aram.mkrtchyan.87@gmail.com> wrote:
> > Hi,
> >
> > hope this will help you
> >
> >     import org.apache.spark.sql.functions._
> >     import sqlContext.implicits._
> >     import java.sql.Timestamp
> >
> >     val df = sc.parallelize(Array((date1, date2))).toDF("day1", "day2")
> >
> >     val dateDiff = udf[Long, Timestamp, Timestamp]((value1, value2) =>
> >           Days.daysBetween(new DateTime(value2.getTime), new
> > DateTime(value1.getTime)).getDays)
> >     df.withColumn("diff", dateDiff(df("day2"), df("day1"))).show()
> >
> > or you can write sql query using hiveql's datediff function.
> >      https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF
> >
> > On Thu, Aug 20, 2015 at 4:57 PM, Dhaval Patel <dhaval1406@gmail.com>
> wrote:
> >>
> >> More update on this question......I am using spark 1.4.1.
> >>
> >> I was just reading documentation of spark 1.5 (still in development)
> and I
> >> think there will be a new func *datediff* that will solve the issue. So
> >> please let me know if there is any work-around until spark 1.5 is out
> :).
> >>
> >> pyspark.sql.functions.datediff(end, start)[source]
> >>
> >> Returns the number of days from start to end.
> >>
> >> >>> df = sqlContext.createDataFrame([('2015-04-08','2015-05-10')],
> ['d1',
> >> >>> 'd2'])
> >> >>> df.select(datediff(df.d2, df.d1).alias('diff')).collect()
> >> [Row(diff=32)]
> >>
> >> New in version 1.5.
> >>
> >>
> >> On Thu, Aug 20, 2015 at 8:26 AM, Dhaval Patel <dhaval1406@gmail.com>
> >> wrote:
> >>>
> >>> Apologies, sent too early accidentally. Actual message is below
> >>> ========================================================
> >>>
> >>> A dataframe has 2 datecolumns (datetime type) and I would like to add
> >>> another column that would have difference between these two dates.
> Dataframe
> >>> snippet is below.
> >>>
> >>> new_df.show(5)
> >>> +-----------+----------+--------------+
> >>> | PATID| SVCDATE|next_diag_date|
> >>> +-----------+----------+--------------+
> >>> |12345655545|2012-02-13| 2012-02-13|
> >>> |12345655545|2012-02-13| 2012-02-13|
> >>> |12345655545|2012-02-13| 2012-02-27|
> >>> +-----------+----------+--------------+
> >>>
> >>>
> >>>
> >>> Here is what I have tried so far:
> >>>
> >>> -> new_df.withColumn('SVCDATE2',
> >>> (new_df.next_diag_date-new_df.SVCDATE)).show()
> >>>     Error: DateType does not support numeric operations
> >>>
> >>> -> new_df.withColumn('SVCDATE2',
> >>> (new_df.next_diag_date-new_df.SVCDATE).days).show()
> >>>     Error: Can't extract value from (next_diag_date#927 - SVCDATE#377);
> >>>
> >>>
> >>> However this simple python code works fine with pySpark:
> >>>
> >>> from datetime import date
> >>> d0 = date(2008, 8, 18)
> >>> d1 = date(2008, 9, 26)
> >>> delta = d0 - d1
> >>> print (d0 - d1).days
> >>>
> >>> # -39
> >>>
> >>>
> >>> Any suggestions would be appreciated! Also is there a way to add a new
> >>> column in dataframe without using column expression (e.g. like in
> pandas or
> >>> R. df$new_col = 'new col value')?
> >>>
> >>>
> >>> Thanks,
> >>> Dhaval
> >>>
> >>>
> >>>
> >>> On Thu, Aug 20, 2015 at 8:18 AM, Dhaval Patel <dhaval1406@gmail.com>
> >>> wrote:
> >>>>
> >>>> new_df.withColumn('SVCDATE2',
> >>>> (new_df.next_diag_date-new_df.SVCDATE).days).show()
> >>>>
> >>>> +-----------+----------+--------------+ | PATID|
> SVCDATE|next_diag_date|
> >>>> +-----------+----------+--------------+ |12345655545|2012-02-13|
> 2012-02-13|
> >>>> |12345655545|2012-02-13| 2012-02-13| |12345655545|2012-02-13|
> 2012-02-27|
> >>>> +-----------+----------+--------------+
> >>>
> >>>
> >>
> >
>

Mime
View raw message