spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <>
Subject Re: RDD Moving Average
Date Tue, 06 Jan 2015 20:43:38 GMT
So you want windows covering the same length of time, some of which will be
fuller than others? You could, for example, simply bucket the data by
minute to get this kind of effect. If you an RDD[Ticker], where Ticker has
a timestamp in ms, you could:

tickerRDD.groupBy(ticker => (ticker.timestamp / 60000) * 60000))

... to get an RDD[(Long,Iterable[Ticker])], where the keys are the moment
at the start of each minute, and the values are the Tickers within the
following minute. You can try variations on this to bucket in different

Just be careful because a minute with a huge number of values might cause
you to run out of memory. If you're just doing aggregations of some kind
there are more efficient methods than this most generic method, like the
aggregate methods.

On Tue, Jan 6, 2015 at 8:34 PM, Asim Jalis <> wrote:

> ​Thanks. Another question. ​I have event data with timestamps. I want to
> create a sliding window using timestamps. Some windows will have a lot of
> events in them others won’t. Is there a way to get an RDD made of this kind
> of a variable length window?
> On Tue, Jan 6, 2015 at 1:03 PM, Sean Owen <> wrote:
>> First you'd need to sort the RDD to give it a meaningful order, but I
>> assume you have some kind of timestamp in your data you can sort on.
>> I think you might be after the sliding() function, a developer API in
>> MLlib:
>> On Tue, Jan 6, 2015 at 5:25 PM, Asim Jalis <> wrote:
>>> Is there an easy way to do a moving average across a single RDD (in a
>>> non-streaming app). Here is the use case. I have an RDD made up of stock
>>> prices. I want to calculate a moving average using a window size of N.
>>> Thanks.
>>> Asim

View raw message