spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Junaid Nasir <jna...@an10.io>
Subject Iterate over grouped df to create new rows/df
Date Fri, 07 Jul 2017 21:06:10 GMT
Hi everyone,

I am kind of stuck in a problem and was hoping for some pointers or help :)
have tried different things but couldn't achieve the desired results.

I want to *create single row from multiple rows if those rows are
continuous* (based on time i.e if next row's time is within 2 minutes of
previous row's time)

 so what i have is this df (after filtering and grouping)

+--------------------+---+-----+|
time|val|group|+--------------------+---+-----+| 2017-01-01 00:00:00|
41|    1|| 2017-01-01 00:01:00| 42|    1|| 2017-01-01 00:02:00| 41|
1|| 2017-01-01 00:15:00| 50|    1|| 2017-01-01 00:18:00| 49|    1||
2017-01-01 00:19:00| 51|    1|| 2017-01-01 00:20:00| 30|
1|+--------------------+---+-----+

from which I want to compute another df

+--------------------+--------------------+-----+|          start
time|            end
time|group|+--------------------+--------------------+-----+|
2017-01-01 00:00:00| 2017-01-01 00:02:00|    1|| 2017-01-01 00:15:00|
2017-01-01 00:15:00|    1|| 2017-01-01 00:18:00| 2017-01-01 00:20:00|
  1|+--------------------+--------------------+-----+

how do I achieve this? UDAF with withColumn only works for aggregation in
single row.
I am using Spark 2.1.0 on zeppelin with pyspark

Mime
View raw message