spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paolo Platter <paolo.plat...@agilelab.it>
Subject R: RDD Moving Average
Date Wed, 07 Jan 2015 00:10:13 GMT
In my opinion you should use fold pattern. Obviously after an sort by trasformation.

Paolo

Inviata dal mio Windows Phone
________________________________
Da: Asim Jalis<mailto:asimjalis@gmail.com>
Inviato: ‎06/‎01/‎2015 23:11
A: Sean Owen<mailto:sowen@cloudera.com>
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Oggetto: Re: RDD Moving Average

One problem with this is that we are creating a lot of iterables containing a lot of repeated
data. Is there a way to do this so that we can calculate a moving average incrementally?

On Tue, Jan 6, 2015 at 4:44 PM, Sean Owen <sowen@cloudera.com<mailto:sowen@cloudera.com>>
wrote:
Yes, if you break it down to...

tickerRDD.map(ticker =>
  (ticker.timestamp, ticker)
).map { case(ts, ticker) =>
  ((ts / 60000) * 60000, ticker)
}.groupByKey

... as Michael alluded to, then it more naturally extends to the sliding window, since you
can flatMap one Ticker to many (bucket, ticker) pairs, then group. I think this would implementing
1 minute buckets, sliding by 10 seconds:

tickerRDD.flatMap(ticker =>
  (ticker.timestamp - 60000 to ticker.timestamp by 15000).map(ts => (ts, ticker))
).map { case(ts, ticker) =>
  ((ts / 60000) * 60000, ticker)
}.groupByKey

On Tue, Jan 6, 2015 at 8:47 PM, Asim Jalis <asimjalis@gmail.com<mailto:asimjalis@gmail.com>>
wrote:
I guess I can use a similar groupBy approach. Map each event to all the windows that it can
belong to. Then do a groupBy, etc. I was wondering if there was a more elegant approach.

On Tue, Jan 6, 2015 at 3:45 PM, Asim Jalis <asimjalis@gmail.com<mailto:asimjalis@gmail.com>>
wrote:
Except I want it to be a sliding window. So the same record could be in multiple buckets.



Mime
View raw message