spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From manasdebashiskar <>
Subject Re: How to create Track per vehicle using spark RDD
Date Wed, 15 Oct 2014 13:12:28 GMT
It is wonderful to see some idea.
Now the questions:
1) What is a track segment?
 Ans) It is the line that contains two adjacent points when all points are
arranged by time. Say a vehicle moves (t1, p1) -> (t2, p2) -> (t3, p3).
Then the segments are (p1, p2), (p2, p3) when the time ordering is (t1 < t2
< t3)
2) What is Lag function.
Ans) Sean's link explains it.

Little bit more to my requirement:
 What I need to calculate is a density Map of vehicles in a certain area.
Because of a user specific requirement I can't use just points but I will
have to use segments.
 I already have a gridRDD containing 1km polygons for the whole world.
My approach is
1) create a tracksegmentRDD of Vehicle, segment
2) do a cartesian of tracksegmentRDD and gridRDD and for each row check if
the segment intersects the polygon. If it does then count it as 1.
3) Group the result above by vehicle(probably reduceByKey(_ + _) ) to get
the density Map

I am checking an issue
which seems to have some potential. I will give it a try.


On Wed, Oct 15, 2014 at 2:55 AM, sowen [via Apache Spark User List] <> wrote:

> You say you reduceByKey but are you really collecting all the tuples
> for a vehicle in a collection, like what groupByKey does already? Yes,
> if one vehicle has a huge amount of data that could fail.
> Otherwise perhaps you are simply not increasing memory from the default.
> Maybe you can consider using something like vehicle and *day* as a
> key. This would make you process each day of data separately, but if
> that's fine for you, might drastically cut down the data associated to
> a single key.
> Spark Streaming has a windowing function, and there is a window
> function for an entire RDD, but I am not sure if there is support for
> a 'window by key' anywhere. You can perhaps get your direct approach
> of collecting events working with some of the changes above.
> Otherwise I think you have to roll your own to some extent, creating
> the overlapping buckets of data, which will mean mapping the data to
> several copies of itself. This might still be quite feasible depending
> on how big a lag you are thinking of.
> PS for the interested, this is what LAG is:
> On Wed, Oct 15, 2014 at 1:37 AM, Manas Kar <[hidden email]
> <http://user/SendEmail.jtp?type=node&node=16471&i=0>> wrote:
> > Hi,
> >  I have an RDD containing Vehicle Number , timestamp, Position.
> >  I want to get the "lag" function equivalent to my RDD to be able to
> create
> > track segment of each Vehicle.
> >
> > Any help?
> >
> > PS: I have tried reduceByKey and then splitting the List of position in
> > tuples. For me it runs out of memory every time because of the volume of
> > data.
> >
> > ...Manas
> >
> > For some reason I have never got any reply to my emails to the user
> group. I
> > am hoping to break that trend this time. :)
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> <http://user/SendEmail.jtp?type=node&node=16471&i=1>
> For additional commands, e-mail: [hidden email]
> <http://user/SendEmail.jtp?type=node&node=16471&i=2>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>  To start a new topic under Apache Spark User List, email
> To unsubscribe from Lag function equivalent in an RDD, click here
> <>
> .
> <>

Manas Kar
View this message in context:
Sent from the Apache Spark User List mailing list archive at
View raw message