spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cheng Lian <lian.cs....@gmail.com>
Subject Re: Can this be done in map-reduce technique (in parallel)
Date Thu, 05 Jun 2014 06:04:17 GMT
Hi Imk,

I think iterator and for-comprehension may help here. I wrote a snippet
that implements your first 2 requirements:

    def distance(a: (Double, Double), b: (Double, Double)): Double = ???

    // Defines some total ordering among locations.
    def lessThan(a: (Double, Double), b: (Double, Double)): Boolean = ???

    sc.textFile("input")
      .map { line =>
        val Array(_, latitude, longitude, ip, _, _) = line.split(",")
        ip -> (latitude.toDouble, longitude.toDouble)
      }
      .groupByKey()
      .mapValues { positions =>
        for {
          a <- positions.iterator
          b <- positions.iterator
          if lessThan(a, b) && distance(a, b) < 100
        } yield {
          (a, b)
        }
      }

The key point is that iterators are lazy evaluated, so that you don’t need
to store the whole cartesian product.

I didn’t quite get your 3rd requirement, but I think you can implement that
following similar approach.

Cheng
​


On Thu, Jun 5, 2014 at 1:11 PM, lmk <lakshmi.muralikrishnan@gmail.com>
wrote:

> Hi Oleg/Andrew,
> Thanks much for the prompt response.
>
> We expect thousands of lat/lon pairs for each IP address. And that is my
> concern with the Cartesian product approach.
> Currently for a small sample of this data (5000 rows) I am grouping by IP
> address and then computing the distance between lat/lon coordinates using
> array manipulation techniques.
> But I understand this approach is not right when the data volume goes up.
> My code is as follows:
>
> val dataset:RDD[String] = sc.textFile("x.csv")
> val data = dataset.map(l=>l.split(","))
> val grpData = data.map(r =>
> (r(3),((r(1).toDouble),r(2).toDouble))).groupByKey()
>
> Now, I have the data grouped by ipaddress as Array[(String,
> Iterable[(Double, Double)])]
> ex..
>  Array((ip1,ArrayBuffer((lat1,lon1), (lat2,lon2), (lat3,lon3)))
>
> Now I have to find the distance between (lat1,lon1) and (lat2,lon2) and
> then
> between (lat1,lon1) and (lat3,lon3) and so on for all combinations.
>
> This is where I get stuck. Please guide me on this.
>
> Thanks Again.
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Can-this-be-handled-in-map-reduce-using-RDDs-tp6905p7016.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Mime
View raw message