spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Armbrust <>
Subject Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY
Date Fri, 04 Nov 2016 00:59:13 GMT
> It is still unclear to me why we should remember all these tricks (or add
> lots of extra little functions) when this elegantly can be expressed in a
> reduce operation with a simple one line lamba function.
I think you can do that too.  KeyValueGroupedDataset has a reduceGroups
function.  This probably won't be as fast though because you end up
creating objects where as the version I gave will get codgened to operate
on binary data the whole way though.

> The same applies to these Window functions. I had to read it 3 times to
> understand what it all means. Maybe it makes sense for someone who has been
> forced to use such limited tools in sql for many years but that's not
> necessary what we should aim for. Why can I not just have the sortBy and
> then an Iterator[X] => Iterator[Y] to express what I want to do?
We also have orderBy and mapPartitions.

> All these functions (rank etc.) can be trivially expressed in this, plus I
> can add other operations if needed, instead of being locked in like this
> Window framework.
 I agree that window functions would probably not be my first choice for
many problems, but for people coming from SQL it was a very popular
feature.  My real goal is to give as many paradigms as possible in a single
unified framework.  Let people pick the right mode of expression for any
given job :)

View raw message