spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Koert Kuipers <ko...@tresata.com>
Subject Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY
Date Thu, 03 Nov 2016 13:29:05 GMT
Just realized you only want to keep first element. You can do this without
sorting by doing something similar to min or max operation using a custom
aggregator/udaf or reduceGroups on Dataset. This is also more efficient.

On Nov 3, 2016 7:53 AM, "Rabin Banerjee" <dev.rabin.banerjee@gmail.com>
wrote:

> Hi All ,
>
>   I want to do a dataframe operation to find the rows having the latest
> timestamp in each group using the below operation
>
> df.orderBy(desc("transaction_date")).groupBy("mobileno").agg(first("customername").as("customername"),first("service_type").as("service_type"),first("cust_addr").as("cust_abbr"))
> .select("customername","service_type","mobileno","cust_addr")
>
>
> *Spark Version :: 1.6.x*
>
> My Question is *"Will Spark guarantee the Order while doing the groupBy , if DF is ordered
using OrderBy previously in Spark 1.6.x"??*
>
>
> *I referred a blog here :: **https://bzhangusc.wordpress.com/2015/05/28/groupby-on-dataframe-is-not-the-groupby-on-rdd/
<https://bzhangusc.wordpress.com/2015/05/28/groupby-on-dataframe-is-not-the-groupby-on-rdd/>*
>
> *Which claims it will work except in Spark 1.5.1 and 1.5.2 .*
>
>
> *I need a bit elaboration of how internally spark handles it ? also is it more efficient
than using a Window function ?*
>
>
> *Thanks in Advance ,*
>
> *Rabin Banerjee*
>
>
>
>

Mime
View raw message