spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <so...@cloudera.com>
Subject Re: Pairwise Processing of a List
Date Mon, 26 Jan 2015 01:21:49 GMT
If this is really about just Scala Lists, then a simple answer (using
tuples of doubles) is:

val points: List[(Double,Double)] = ...
val distances = for (p1 <- points; p2 <- points) yield {
  val dx = p1._1 - p2._1
  val dy = p1._2 - p2._2
  math.sqrt(dx*dx + dy*dy)
}
distances.sum / 2

It's "/ 2" since this counts every pair twice. You could double the
speed of that, with a slightly more complex formulation using indices,
that avoids comparing points to themselves and makes each comparison
just once.

If you really need the sum of all pairwise distances, I don't think
you can do better than that (modulo dealing with duplicates
intelligently).

If we're talking RDDs, then the simple answer is similar:

val pointsRDD: RDD[(Double,Double)] = ...
val distancesRDD = pointsRDD.cartesian(pointsRDD).map { case (p1, p2) => ... }
distancesRDD.sum / 2

It takes more work to make the same optimization, and involves
zipWithIndex, but is possible.

If the reason we're talking about Lists is that the set of points is
still fairly small, but big enough that all-pairs deserves distributed
computation, then I'd parallelize the List into an RDD, and also
broadcast it, and then implement a hybrid of these two approaches.
You'd have the outer loop over points happening in parallel via the
RDD, and inner loop happening locally over the local broadcasted copy
in memory.

... and if the use case isn't really to find all-pairs distances and
their sum, maybe there are faster ways still to do what you need to.

On Mon, Jan 26, 2015 at 12:32 AM, Steve Nunez <snunez@hortonworks.com> wrote:
> Spark Experts,
>
> I’ve got a list of points: List[(Float, Float)]) that represent (x,y)
> coordinate pairs and need to sum the distance. It’s easy enough to compute
> the distance:
>
> case class Point(x: Float, y: Float) {
>   def distance(other: Point): Float =
>     sqrt(pow(x - other.x, 2) + pow(y - other.y, 2)).toFloat
> }
>
> (in this case I create a ‘Point’ class, but the maths are the same).
>
> What I can’t figure out is the ‘right’ way to sum distances between all the
> points. I can make this work by traversing the list with a for loop and
> using indices, but this doesn’t seem right.
>
> Anyone know a clever way to process List[(Float, Float)]) in a pairwise
> fashion?
>
> Regards,
> - Steve
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message