spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aung Htet <aung....@gmail.com>
Subject How to get a top X percent of a distribution represented as RDD
Date Fri, 27 Mar 2015 03:31:21 GMT
Hi all,

I have a distribution represented as an RDD of tuples, in rows of (segment,
score)
For each segment, I want to discard tuples with top X percent scores. This
seems hard to do in Spark RDD.

A naive algorithm would be -

1) Sort RDD by segment & score (descending)
2) Within each segment, number the rows from top to bottom.
3) For each  segment, calculate the cut off index. i.e. 90 for 10% cut off
out of a segment with 100 rows.
4) For the entire RDD, filter rows with row num <= cut off index

This does not look like a good algorithm. I would really appreciate if
someone can suggest a better way to implement this in Spark.

Regards,
Aung

Mime
View raw message