spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Asim Jalis <asimja...@gmail.com>
Subject Re: RDD Moving Average
Date Tue, 06 Jan 2015 22:09:46 GMT
One problem with this is that we are creating a lot of iterables containing
a lot of repeated data. Is there a way to do this so that we can calculate
a moving average incrementally?

On Tue, Jan 6, 2015 at 4:44 PM, Sean Owen <sowen@cloudera.com> wrote:

> Yes, if you break it down to...
>
> tickerRDD.map(ticker =>
>   (ticker.timestamp, ticker)
> ).map { case(ts, ticker) =>
>   ((ts / 60000) * 60000, ticker)
> }.groupByKey
>
> ... as Michael alluded to, then it more naturally extends to the sliding
> window, since you can flatMap one Ticker to many (bucket, ticker) pairs,
> then group. I think this would implementing 1 minute buckets, sliding by 10
> seconds:
>
> tickerRDD.flatMap(ticker =>
>   (ticker.timestamp - 60000 to ticker.timestamp by 15000).map(ts => (ts,
> ticker))
> ).map { case(ts, ticker) =>
>   ((ts / 60000) * 60000, ticker)
> }.groupByKey
>
> On Tue, Jan 6, 2015 at 8:47 PM, Asim Jalis <asimjalis@gmail.com> wrote:
>
>> I guess I can use a similar groupBy approach. Map each event to all the
>> windows that it can belong to. Then do a groupBy, etc. I was wondering if
>> there was a more elegant approach.
>>
>> On Tue, Jan 6, 2015 at 3:45 PM, Asim Jalis <asimjalis@gmail.com> wrote:
>>
>>> Except I want it to be a sliding window. So the same record could be in
>>> multiple buckets.
>>>
>>>

Mime
View raw message