spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Koert Kuipers <ko...@tresata.com>
Subject Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY
Date Thu, 03 Nov 2016 23:46:46 GMT
Oh okay that makes sense. The trick is to take max on tuple2 so you carry
the other column along.

It is still unclear to me why we should remember all these tricks (or add
lots of extra little functions) when this elegantly can be expressed in a
reduce operation with a simple one line lamba function.

The same applies to these Window functions. I had to read it 3 times to
understand what it all means. Maybe it makes sense for someone who has been
forced to use such limited tools in sql for many years but that's not
necessary what we should aim for. Why can I not just have the sortBy and
then an Iterator[X] => Iterator[Y] to express what I want to do? All these
functions (rank etc.) can be trivially expressed in this, plus I can add
other operations if needed, instead of being locked in like this Window
framework.

On Nov 3, 2016 4:10 PM, "Michael Armbrust" <michael@databricks.com> wrote:

You are looking to perform an *argmax*, which you can do with a single
aggregation.  Here is an example
<https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/3170497669323442/2840265927289860/latest.html>
.

On Thu, Nov 3, 2016 at 4:53 AM, Rabin Banerjee <dev.rabin.banerjee@gmail.com
> wrote:

> Hi All ,
>
>   I want to do a dataframe operation to find the rows having the latest
> timestamp in each group using the below operation
>
> df.orderBy(desc("transaction_date")).groupBy("mobileno").agg(first("customername").as("customername"),first("service_type").as("service_type"),first("cust_addr").as("cust_abbr"))
> .select("customername","service_type","mobileno","cust_addr")
>
>
> *Spark Version :: 1.6.x*
>
> My Question is *"Will Spark guarantee the Order while doing the groupBy , if DF is ordered
using OrderBy previously in Spark 1.6.x"??*
>
>
> *I referred a blog here :: **https://bzhangusc.wordpress.com/2015/05/28/groupby-on-dataframe-is-not-the-groupby-on-rdd/
<https://bzhangusc.wordpress.com/2015/05/28/groupby-on-dataframe-is-not-the-groupby-on-rdd/>*
>
> *Which claims it will work except in Spark 1.5.1 and 1.5.2 .*
>
>
> *I need a bit elaboration of how internally spark handles it ? also is it more efficient
than using a Window function ?*
>
>
> *Thanks in Advance ,*
>
> *Rabin Banerjee*
>
>
>
>

Mime
View raw message