spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andy Dang <nam...@gmail.com>
Subject Re: top-k function for Window
Date Tue, 03 Jan 2017 18:03:20 GMT
> Furthermore, in your example you don’t even need a window function, you
can simply use groupby and explode

Can you clarify? You need to sort somehow (be it map-side sorting or
reduce-side sorting).



-------
Regards,
Andy

On Tue, Jan 3, 2017 at 2:07 PM, Mendelson, Assaf <Assaf.Mendelson@rsa.com>
wrote:

> You can write a UDAF in which the buffer contains the top K and manage it.
> This means you don’t need to sort at all. Furthermore, in your example you
> don’t even need a window function, you can simply use groupby and explode.
>
> Of course, this is only relevant if k is small…
>
>
>
> *From:* Andy Dang [mailto:namd88@gmail.com]
> *Sent:* Tuesday, January 03, 2017 3:07 PM
> *To:* user
> *Subject:* top-k function for Window
>
>
>
> Hi all,
>
>
>
> What's the best way to do top-k with Windowing in Dataset world?
>
>
>
> I have a snippet of code that filters the data to the top-k, but with
> skewed keys:
>
>
>
> val windowSpec = Window.parititionBy(skewedKeys).orderBy(dateTime)
>
> val rank = row_number().over(windowSpec)
>
>
>
> input.withColumn("rank", rank).filter("rank <= 10").drop("rank")
>
>
>
> The problem with this code is that Spark doesn't know that it can sort the
> data locally, get the local rank first. What it ends up doing is performing
> a sort by key using the skewed keys, and this blew up the cluster since the
> keys are heavily skewed.
>
>
>
> In the RDD world we can do something like:
>
> rdd.mapPartitioins(iterator -> topK(iterator))
>
> but I can't really think of an obvious to do this in the Dataset API,
> especially with Window function. I guess some UserAggregateFunction would
> do, but I wonder if there's obvious way that I missed.
>
>
>
> -------
> Regards,
> Andy
>

Mime
View raw message