spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Nastetsky <>
Subject Re: spark sql aggregate function "Nth"
Date Tue, 26 Jul 2016 16:05:31 GMT
Ah, that gives me an idea.

val window = Window.partitionBy(<my grouping>)
val getRand = udf((cnt:Int) => <return random num between 1 and cnt> )

.withColumn("cnt", count(<some col>).over(window))
.withColumn("rnd", getRand($"cnt"))
.where($"rnd" === $"cnt")

Not sure how performant this would be, but writing a UDF is much simpler
than a UDAF.

On Tue, Jul 26, 2016 at 11:48 AM, ayan guha <> wrote:

> You can use rank with window function. Rank=1 is same as calling first().
> Not sure how you would randomly pick records though, if there is no Nth
> record. In your example, what happens if data is of only 2 rows?
> On 27 Jul 2016 00:57, "Alex Nastetsky" <>
> wrote:
>> Spark SQL has a "first" function that returns the first item in a group.
>> Is there a similar function, perhaps in a third party lib, that allows you
>> to return an arbitrary (e.g. 3rd) item from the group? Was thinking of
>> writing a UDAF for it, but didn't want to reinvent the wheel. My endgoal is
>> to be able to select a random item from the group, using random number
>> generator.
>> Thanks.

View raw message