Hi Imk,

I think iterator and for-comprehens= ion may help here. I wrote a snippet that implements your first 2 requireme= nts:

``````    def distance(a: (Doub=
le, Double), b: (Double, Double)): Double =3D ???

// Defines some total ordering among locations.
=
def lessThan(a: (Double, Double), b: (Double, Double)): Boolean =3D =
???

sc.textFile("=
;input")
.map { line =3D>
val Array(_, latitude, longitude, ip, _, _) =3D line.split(",")
ip -> (latitude.toDouble, longitude.toDouble)
}
.groupByKey()
.mapValues { positions =3D>
for {
a <- positions.iterator
b <- positions.iterator
if lessThan(a, b) && distance(a, b) < 100
} yield {
(a, b)
}
}
``````

The key point is that iterators are= lazy evaluated, so that you don=E2=80=99t need to store the whole cartesia= n product.

I didn=E2=80=99t quite get your 3rd= requirement, but I think you can implement that following similar approach= .

Cheng

=E2=80=8B

On Thu, Jun 5, 2014 at 1:11 PM, lmk wrote:
Hi Oleg/Andrew,
Thanks much for the prompt response.

We expect thousands of lat/lon pairs for each IP address. And that is my concern with the Cartesian product approach.
Currently for a small sample of this data (5000 rows) I am grouping by IP address and then computing the distance between lat/lon coordinates using array manipulation techniques.
But I understand this approach is not right when the data volume goes up. My code is as follows:

val dataset:RDD[String] =3D sc.textFile("x.csv")
val data =3D dataset.map(l=3D>l.split(","))
val grpData =3D data.map(r =3D>
(r(3),((r(1).toDouble),r(2).toDouble))).groupByKey()

Now, I have the data grouped by ipaddress as Array[(String,
Iterable[(Double, Double)])]
ex..
=C2=A0Array((ip1,ArrayBuffer((lat1,lon1), (lat2,lon2), (lat3,lon3)))

Now I have to find the distance between (lat1,lon1) and (lat2,lon2) and the= n
between (lat1,lon1) and (lat3,lon3) and so on for all combinations.

This is where I get stuck. Please guide me on this.

Thanks Again.

--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com= /Can-this-be-handled-in-map-reduce-using-RDDs-tp6905p7016.html
Sent from the Apache Spark User Lis= t mailing list archive at Nabble.com.

--001a11c2cd281b290304fb10850b--