spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Asim Jalis <asimja...@gmail.com>
Subject Re: RDD Moving Average
Date Tue, 06 Jan 2015 20:45:40 GMT
Except I want it to be a sliding window. So the same record could be in
multiple buckets.

On Tue, Jan 6, 2015 at 3:43 PM, Sean Owen <sowen@cloudera.com> wrote:

> So you want windows covering the same length of time, some of which will
> be fuller than others? You could, for example, simply bucket the data by
> minute to get this kind of effect. If you an RDD[Ticker], where Ticker has
> a timestamp in ms, you could:
>
> tickerRDD.groupBy(ticker => (ticker.timestamp / 60000) * 60000))
>
> ... to get an RDD[(Long,Iterable[Ticker])], where the keys are the moment
> at the start of each minute, and the values are the Tickers within the
> following minute. You can try variations on this to bucket in different
> ways.
>
> Just be careful because a minute with a huge number of values might cause
> you to run out of memory. If you're just doing aggregations of some kind
> there are more efficient methods than this most generic method, like the
> aggregate methods.
>
> On Tue, Jan 6, 2015 at 8:34 PM, Asim Jalis <asimjalis@gmail.com> wrote:
>
>> ​Thanks. Another question. ​I have event data with timestamps. I want to
>> create a sliding window using timestamps. Some windows will have a lot of
>> events in them others won’t. Is there a way to get an RDD made of this kind
>> of a variable length window?
>>
>>
>> On Tue, Jan 6, 2015 at 1:03 PM, Sean Owen <sowen@cloudera.com> wrote:
>>
>>> First you'd need to sort the RDD to give it a meaningful order, but I
>>> assume you have some kind of timestamp in your data you can sort on.
>>>
>>> I think you might be after the sliding() function, a developer API in
>>> MLlib:
>>>
>>>
>>> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/RDDFunctions.scala#L43
>>>
>>> On Tue, Jan 6, 2015 at 5:25 PM, Asim Jalis <asimjalis@gmail.com> wrote:
>>>
>>>> Is there an easy way to do a moving average across a single RDD (in a
>>>> non-streaming app). Here is the use case. I have an RDD made up of stock
>>>> prices. I want to calculate a moving average using a window size of N.
>>>>
>>>> Thanks.
>>>>
>>>> Asim
>>>>
>>>
>>>
>>
>

Mime
View raw message