spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Ash <and...@andrewash.com>
Subject Implementing rdd.scanLeft()
Date Thu, 05 Jun 2014 17:29:17 GMT
I have a use case that would greatly benefit from RDDs having a .scanLeft()
method.  Are the project developers interested in adding this to the public
API?


Looking through past message traffic, this has come up a few times.  The
recommendation from the list before has been to implement a parallel prefix
scan.

http://comments.gmane.org/gmane.comp.lang.scala.spark.user/1880
https://groups.google.com/forum/#!topic/spark-users/ts-FdB50ltY

The algorithm Reynold sketched in the first link leads to this working
implementation:

val vector = sc.parallelize(1 to 20, 3)

val sums = 0 +: vector.mapPartitionsWithIndex{ case(partition, iter) =>
Iterator(iter.sum) }.collect.scanLeft(0)(_+_).drop(1)

val prefixScan = vector.mapPartitionsWithIndex { case(partition, iter) =>
  val base = sums(partition)
  println(partition, base)
  iter.scanLeft(base)(_+_).drop(1)
}.collect


I'd love to have that replaced with this:

val vector = sc.parallelize(1 to 20, 3)
val cumSum: RDD[Int] = vector.scanLeft(0)(_+_)


Any thoughts on whether this contribution would be accepted?  What pitfalls
exist that I should be thinking about?

Thanks!
Andrew

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message