It is possible if you use a cartesian product to produce all possible
pairs for each IP address and 2 stages of mapreduce:
 first by pairs of points to find the total of each pair and
 second by IP address to find the pair for each IP address with the
maximum count.
Oleg
On 4 June 2014 11:49, lmk <lakshmi.muralikrishnan@gmail.com> wrote:
> Hi,
> I am a new spark user. Pls let me know how to handle the following
> scenario:
>
> I have a data set with the following fields:
> 1. DeviceId
> 2. latitude
> 3. longitude
> 4. ip address
> 5. Datetime
> 6. Mobile application name
>
> With the above data, I would like to perform the following steps:
> 1. Collect all lat and lon for each ipaddress
> (ip1,(lat1,lon1),(lat2,lon2))
> (ip2,(lat3,lon3),(lat4,lat5))
> 2. For each IP,
> 1.Find the distance between each lat and lon coordinate pair and
> all
> the other pairs under the same IP
> 2.Select those coordinates whose distances fall under a specific
> threshold (say 100m)
> 3.Find the coordinate pair with the maximum occurrences
>
> In this case, how can I iterate and compare each coordinate pair with all
> the other pairs?
> Can this be done in a distributed manner, as this data set is going to have
> a few million records?
> Can we do this in map/reduce commands?
>
> Thanks.
>
>
>
> 
>

