spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Koert Kuipers <ko...@tresata.com>
Subject Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY
Date Fri, 04 Nov 2016 13:18:09 GMT
sure, but then my values are not sorted per key, right?

so a group by key with values sorted according to to some ordering is an
operation that can be done efficiently in a single shuffle without first
figuring out range boundaries. and it is needed for quite a few algos,
including Window and lots of timeseries stuff. but it seems there is no way
to express i want to do this yet (at least not in an efficient way).

which makes me wonder, what does Window do?


On Fri, Nov 4, 2016 at 12:59 AM, Michael Armbrust <michael@databricks.com>
wrote:

> Thinking out loud is good :)
>
> You are right in that anytime you ask for a global ordering from Spark you
> will pay the cost of figuring out the range boundaries for partitions.  If
> you say orderBy, though, we aren't sure that you aren't expecting a
> global order.
>
> If you only want to make sure that items are colocated, it is cheaper to
> do a groupByKey followed by a flatMapGroups
> <https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/1828840559545742/2840265927289860/latest.html>
> .
>
>
>
> On Thu, Nov 3, 2016 at 7:31 PM, Koert Kuipers <koert@tresata.com> wrote:
>
>> i guess i could sort by (hashcode(key), key, secondarySortColumn) and
>> then do mapPartitions?
>>
>> sorry thinking out loud a bit here. ok i think that could work. thanks
>>
>> On Thu, Nov 3, 2016 at 10:25 PM, Koert Kuipers <koert@tresata.com> wrote:
>>
>>> thats an interesting thought about orderBy and mapPartitions. i guess i
>>> could emulate a groupBy with secondary sort using those two. however isn't
>>> using an orderBy expensive since it is a total sort? i mean a groupBy with
>>> secondary sort is also a total sort under the hood, but its on
>>> (hashCode(key), secondarySortColumn) which is easier to distribute and
>>> therefore can be implemented more efficiently.
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Nov 3, 2016 at 8:59 PM, Michael Armbrust <michael@databricks.com
>>> > wrote:
>>>
>>>> It is still unclear to me why we should remember all these tricks (or
>>>>> add lots of extra little functions) when this elegantly can be expressed
in
>>>>> a reduce operation with a simple one line lamba function.
>>>>>
>>>> I think you can do that too.  KeyValueGroupedDataset has a
>>>> reduceGroups function.  This probably won't be as fast though because you
>>>> end up creating objects where as the version I gave will get codgened to
>>>> operate on binary data the whole way though.
>>>>
>>>>> The same applies to these Window functions. I had to read it 3 times
>>>>> to understand what it all means. Maybe it makes sense for someone who
has
>>>>> been forced to use such limited tools in sql for many years but that's
not
>>>>> necessary what we should aim for. Why can I not just have the sortBy
and
>>>>> then an Iterator[X] => Iterator[Y] to express what I want to do?
>>>>>
>>>> We also have orderBy and mapPartitions.
>>>>
>>>>> All these functions (rank etc.) can be trivially expressed in this,
>>>>> plus I can add other operations if needed, instead of being locked in
like
>>>>> this Window framework.
>>>>>
>>>>  I agree that window functions would probably not be my first choice
>>>> for many problems, but for people coming from SQL it was a very popular
>>>> feature.  My real goal is to give as many paradigms as possible in a single
>>>> unified framework.  Let people pick the right mode of expression for any
>>>> given job :)
>>>>
>>>
>>>
>>
>

Mime
View raw message