spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Asim Jalis <asimja...@gmail.com>
Subject Re: RDD Moving Average
Date Tue, 06 Jan 2015 20:47:16 GMT
I guess I can use a similar groupBy approach. Map each event to all the
windows that it can belong to. Then do a groupBy, etc. I was wondering if
there was a more elegant approach.

On Tue, Jan 6, 2015 at 3:45 PM, Asim Jalis <asimjalis@gmail.com> wrote:

> Except I want it to be a sliding window. So the same record could be in
> multiple buckets.
>
> On Tue, Jan 6, 2015 at 3:43 PM, Sean Owen <sowen@cloudera.com> wrote:
>
>> So you want windows covering the same length of time, some of which will
>> be fuller than others? You could, for example, simply bucket the data by
>> minute to get this kind of effect. If you an RDD[Ticker], where Ticker has
>> a timestamp in ms, you could:
>>
>> tickerRDD.groupBy(ticker => (ticker.timestamp / 60000) * 60000))
>>
>> ... to get an RDD[(Long,Iterable[Ticker])], where the keys are the moment
>> at the start of each minute, and the values are the Tickers within the
>> following minute. You can try variations on this to bucket in different
>> ways.
>>
>> Just be careful because a minute with a huge number of values might cause
>> you to run out of memory. If you're just doing aggregations of some kind
>> there are more efficient methods than this most generic method, like the
>> aggregate methods.
>>
>> On Tue, Jan 6, 2015 at 8:34 PM, Asim Jalis <asimjalis@gmail.com> wrote:
>>
>>> ​Thanks. Another question. ​I have event data with timestamps. I want to
>>> create a sliding window using timestamps. Some windows will have a lot of
>>> events in them others won’t. Is there a way to get an RDD made of this kind
>>> of a variable length window?
>>>
>>>
>>> On Tue, Jan 6, 2015 at 1:03 PM, Sean Owen <sowen@cloudera.com> wrote:
>>>
>>>> First you'd need to sort the RDD to give it a meaningful order, but I
>>>> assume you have some kind of timestamp in your data you can sort on.
>>>>
>>>> I think you might be after the sliding() function, a developer API in
>>>> MLlib:
>>>>
>>>>
>>>> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/RDDFunctions.scala#L43
>>>>
>>>> On Tue, Jan 6, 2015 at 5:25 PM, Asim Jalis <asimjalis@gmail.com> wrote:
>>>>
>>>>> Is there an easy way to do a moving average across a single RDD (in a
>>>>> non-streaming app). Here is the use case. I have an RDD made up of stock
>>>>> prices. I want to calculate a moving average using a window size of N.
>>>>>
>>>>> Thanks.
>>>>>
>>>>> Asim
>>>>>
>>>>
>>>>
>>>
>>
>

Mime
View raw message