Hi all,

I have a distribution represented as an RDD of tuples, in rows of (segment, score) 
For each segment, I want to discard tuples with top X percent scores. This seems hard to do in Spark RDD.

A naive algorithm would be -

1) Sort RDD by segment & score (descending)
2) Within each segment, number the rows from top to bottom.
3) For each  segment, calculate the cut off index. i.e. 90 for 10% cut off out of a segment with 100 rows.
4) For the entire RDD, filter rows with row num <= cut off index

This does not look like a good algorithm. I would really appreciate if someone can suggest a better way to implement this in Spark.