spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Oleg Proudnikov <oleg.proudni...@gmail.com>
Subject Re: Can this be done in map-reduce technique (in parallel)
Date Wed, 04 Jun 2014 12:29:17 GMT
 It is possible if you use a cartesian product to produce all possible
pairs for each IP address and 2 stages of map-reduce:
 - first by pairs of points to find the total of each pair and
-  second by IP address to find the pair for each IP address with the
maximum count.

Oleg



On 4 June 2014 11:49, lmk <lakshmi.muralikrishnan@gmail.com> wrote:

> Hi,
> I am a new spark user. Pls let me know how to handle the following
> scenario:
>
> I have a data set with the following fields:
> 1. DeviceId
> 2. latitude
> 3. longitude
> 4. ip address
> 5. Datetime
> 6. Mobile application name
>
> With the above data, I would like to perform the following steps:
> 1. Collect all lat and lon for each ipaddress
>         (ip1,(lat1,lon1),(lat2,lon2))
>         (ip2,(lat3,lon3),(lat4,lat5))
> 2. For each IP,
>         1.Find the distance between each lat and lon coordinate pair and
> all
> the other pairs under the same IP
>         2.Select those coordinates whose distances fall under a specific
> threshold (say 100m)
>         3.Find the coordinate pair with the maximum occurrences
>
> In this case, how can I iterate and compare each coordinate pair with all
> the other pairs?
> Can this be done in a distributed manner, as this data set is going to have
> a few million records?
> Can we do this in map/reduce commands?
>
> Thanks.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Can-this-be-done-in-map-reduce-technique-in-parallel-tp6905.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>



-- 
Kind regards,

Oleg

Mime
View raw message