spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tathagata Das <t...@databricks.com>
Subject Re: Efficient Top count in each window
Date Thu, 12 Mar 2015 18:17:25 GMT
Why are you repartitioning 1? That would obviously be slow, you are
converting a distributed operation to a single node operation.
Also consider using RDD.top(). If you define the ordering right (based on
the count), then you will get top K across then without doing a shuffle for
sortByKey. Much cheaper.

TD



On Thu, Mar 12, 2015 at 11:06 AM, Laeeq Ahmed <laeeqspark@yahoo.com.invalid>
wrote:

> Hi,
>
> I have a streaming application where am doing top 10 count in each window
> which seems slow. Is there efficient way to do this.
>
>         val counts = keyAndValues.map(x =>
> math.round(x._3.toDouble)).countByValueAndWindow(Seconds(4), Seconds(4))
>         val topCounts = counts.repartition(1).map(_.swap).transform(rdd =>
> rdd.sortByKey(false)).map(_.swap).mapPartitions(rdd => rdd.take(10))
>
> Regards,
> Laeeq
>
>

Mime
View raw message