spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Koert Kuipers <ko...@tresata.com>
Subject Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY
Date Thu, 03 Nov 2016 14:37:20 GMT
i did not check the claim in that blog post that the data is ordered, but i
wouldnt rely on that behavior since it is not something the api guarantees
and could change in future versions

On Thu, Nov 3, 2016 at 9:59 AM, Rabin Banerjee <dev.rabin.banerjee@gmail.com
> wrote:

> Hi Koert & Robin ,
>
> *  Thanks ! *But if you go through the blog https://bzhangusc.
> wordpress.com/2015/05/28/groupby-on-dataframe-is-not-the-groupby-on-rdd/ and
> check the comments under the blog it's actually working, although I am not
> sure how . And yes I agree a custom aggregate UDAF is a good option .
>
> Can anyone share the best way to implement this in Spark .?
>
> Regards,
> Rabin Banerjee
>
> On Thu, Nov 3, 2016 at 6:59 PM, Koert Kuipers <koert@tresata.com> wrote:
>
>> Just realized you only want to keep first element. You can do this
>> without sorting by doing something similar to min or max operation using a
>> custom aggregator/udaf or reduceGroups on Dataset. This is also more
>> efficient.
>>
>> On Nov 3, 2016 7:53 AM, "Rabin Banerjee" <dev.rabin.banerjee@gmail.com>
>> wrote:
>>
>>> Hi All ,
>>>
>>>   I want to do a dataframe operation to find the rows having the latest
>>> timestamp in each group using the below operation
>>>
>>> df.orderBy(desc("transaction_date")).groupBy("mobileno").agg(first("customername").as("customername"),first("service_type").as("service_type"),first("cust_addr").as("cust_abbr"))
>>> .select("customername","service_type","mobileno","cust_addr")
>>>
>>>
>>> *Spark Version :: 1.6.x*
>>>
>>> My Question is *"Will Spark guarantee the Order while doing the groupBy , if
DF is ordered using OrderBy previously in Spark 1.6.x"??*
>>>
>>>
>>> *I referred a blog here :: **https://bzhangusc.wordpress.com/2015/05/28/groupby-on-dataframe-is-not-the-groupby-on-rdd/
<https://bzhangusc.wordpress.com/2015/05/28/groupby-on-dataframe-is-not-the-groupby-on-rdd/>*
>>>
>>> *Which claims it will work except in Spark 1.5.1 and 1.5.2 .*
>>>
>>>
>>> *I need a bit elaboration of how internally spark handles it ? also is it more
efficient than using a Window function ?*
>>>
>>>
>>> *Thanks in Advance ,*
>>>
>>> *Rabin Banerjee*
>>>
>>>
>>>
>>>
>

Mime
View raw message