spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xuefeng Wu <ben...@gmail.com>
Subject How take top N of top M from RDD as RDD
Date Mon, 01 Dec 2014 14:44:00 GMT
Hi, I have a problem, it is easy in Scala code, but I can not take the top
N from RDD as RDD.


There are 10000 Student Score, ask take top 10 age, and then take top 10
from each age, the result is 100 records.

The Scala code is here, but how can I do it in RDD,  *for RDD.take return
is Array, but other RDD.*

example Scala code:

import scala.util.Random

case class StudentScore(age: Int, num: Int, score: Int, name: Int)

val scores = for {
  i <- 1 to 10000
} yield {
  StudentScore(Random.nextInt(100), Random.nextInt(100),
Random.nextInt(), Random.nextInt())
}


def takeTop(scores: Seq[StudentScore], byKey: StudentScore => Int):
Seq[(Int, Seq[StudentScore])] = {
  val groupedScore = scores.groupBy(byKey)
                           .map{case (_, _scores) =>
(_scores.foldLeft(0)((acc, v) => acc + v.score), _scores)}.toSeq
  groupedScore.sortBy(_._1).take(10)
}

val topScores = for {
  (_, ageScores) <- takeTop(scores, _.age)
  (_, numScores) <- takeTop(ageScores, _.num)
} yield {
  numScores
}

topScores.size


-- 

~Yours, Xuefeng Wu/吴雪峰  敬上

Mime
View raw message