spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Koert Kuipers <>
Subject Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY
Date Fri, 04 Nov 2016 02:31:09 GMT
i guess i could sort by (hashcode(key), key, secondarySortColumn) and then
do mapPartitions?

sorry thinking out loud a bit here. ok i think that could work. thanks

On Thu, Nov 3, 2016 at 10:25 PM, Koert Kuipers <> wrote:

> thats an interesting thought about orderBy and mapPartitions. i guess i
> could emulate a groupBy with secondary sort using those two. however isn't
> using an orderBy expensive since it is a total sort? i mean a groupBy with
> secondary sort is also a total sort under the hood, but its on
> (hashCode(key), secondarySortColumn) which is easier to distribute and
> therefore can be implemented more efficiently.
> On Thu, Nov 3, 2016 at 8:59 PM, Michael Armbrust <>
> wrote:
>> It is still unclear to me why we should remember all these tricks (or add
>>> lots of extra little functions) when this elegantly can be expressed in a
>>> reduce operation with a simple one line lamba function.
>> I think you can do that too.  KeyValueGroupedDataset has a reduceGroups
>> function.  This probably won't be as fast though because you end up
>> creating objects where as the version I gave will get codgened to operate
>> on binary data the whole way though.
>>> The same applies to these Window functions. I had to read it 3 times to
>>> understand what it all means. Maybe it makes sense for someone who has been
>>> forced to use such limited tools in sql for many years but that's not
>>> necessary what we should aim for. Why can I not just have the sortBy and
>>> then an Iterator[X] => Iterator[Y] to express what I want to do?
>> We also have orderBy and mapPartitions.
>>> All these functions (rank etc.) can be trivially expressed in this, plus
>>> I can add other operations if needed, instead of being locked in like this
>>> Window framework.
>>  I agree that window functions would probably not be my first choice for
>> many problems, but for people coming from SQL it was a very popular
>> feature.  My real goal is to give as many paradigms as possible in a single
>> unified framework.  Let people pick the right mode of expression for any
>> given job :)

View raw message