spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tobias Pfeiffer <...@preferred.jp>
Subject Re: *ByKey aggregations: performance + order
Date Wed, 14 Jan 2015 04:53:52 GMT
Hi,

On Wed, Jan 14, 2015 at 12:11 PM, Tobias Pfeiffer <tgp@preferred.jp> wrote:
>
> Now I don't know (yet) if all of the functions I want to compute can be
> expressed in this way and I was wondering about *how much* more expensive
> we are talking about.
>

OK, it seems like even on a local machine (with no network overhead), the
groupByKey version is about 5 times slower than any of the other
(reduceByKey, combineByKey etc.) functions...

  val rdd = sc.parallelize(1 to 5000000)
  val withKeys = rdd.zipWithIndex.map(kv => (kv._2/10, kv._1))
  withKeys.cache()
  withKeys.count

  // around 850-1100 ms
  for (i <- 1 to 5) yield {
    val start = System.currentTimeMillis
    withKeys.reduceByKey(_ + _).count()
    System.currentTimeMillis - start
  }

  // around 800-1100 ms
  for (i <- 1 to 5) yield {
    val start = System.currentTimeMillis
    withKeys.combineByKey((x: Int) => x, (x: Int, y: Int) => x + y,
      (x: Int, y: Int) => x + y).count()
    System.currentTimeMillis - start
  }

  // around 1500-1900 ms
  for (i <- 1 to 5) yield {
    val start = System.currentTimeMillis
    withKeys.foldByKey(0)(_ + _).count()
    System.currentTimeMillis - start
  }

  // around 1400-1800 ms
  for (i <- 1 to 5) yield {
    val start = System.currentTimeMillis
    withKeys.aggregateByKey(0)(_ + _, _ + _).count()
    System.currentTimeMillis - start
  }

  // around 5500-6200 ms
  for (i <- 1 to 5) yield {
    val start = System.currentTimeMillis
    withKeys.groupByKey().mapValues(_.reduceLeft(_ + _)).count()
    System.currentTimeMillis - start
  }

Tobias

Mime
View raw message