spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Asim Jalis <asimja...@gmail.com>
Subject Re: RDD Moving Average
Date Wed, 07 Jan 2015 00:47:58 GMT
One approach I was considering was to use mapPartitions. It is
straightforward to compute the moving average over a partition, except for
near the end point. Does anyone see how to fix that?

On Tue, Jan 6, 2015 at 7:20 PM, Sean Owen <sowen@cloudera.com> wrote:

> Interesting, I am not sure the order in which fold() encounters elements
> is guaranteed, although from reading the code, I imagine in practice it is
> first-to-last by partition and then folded first-to-last from those results
> on the driver. I don't know this would lead to a solution though as the
> result here needs to be an RDD, not one value.
>
> On Wed, Jan 7, 2015 at 12:10 AM, Paolo Platter <paolo.platter@agilelab.it>
> wrote:
>
>>  In my opinion you should use fold pattern. Obviously after an sort by
>> trasformation.
>>
>> Paolo
>>
>> Inviata dal mio Windows Phone
>>  ------------------------------
>> Da: Asim Jalis <asimjalis@gmail.com>
>> Inviato: ‎06/‎01/‎2015 23:11
>> A: Sean Owen <sowen@cloudera.com>
>> Cc: user@spark.apache.org
>> Oggetto: Re: RDD Moving Average
>>
>>   One problem with this is that we are creating a lot of iterables
>> containing a lot of repeated data. Is there a way to do this so that we can
>> calculate a moving average incrementally?
>>
>> On Tue, Jan 6, 2015 at 4:44 PM, Sean Owen <sowen@cloudera.com> wrote:
>>
>>> Yes, if you break it down to...
>>>
>>>  tickerRDD.map(ticker =>
>>>   (ticker.timestamp, ticker)
>>> ).map { case(ts, ticker) =>
>>>   ((ts / 60000) * 60000, ticker)
>>> }.groupByKey
>>>
>>>  ... as Michael alluded to, then it more naturally extends to the
>>> sliding window, since you can flatMap one Ticker to many (bucket, ticker)
>>> pairs, then group. I think this would implementing 1 minute buckets,
>>> sliding by 10 seconds:
>>>
>>>  tickerRDD.flatMap(ticker =>
>>>   (ticker.timestamp - 60000 to ticker.timestamp by 15000).map(ts => (ts,
>>> ticker))
>>> ).map { case(ts, ticker) =>
>>>   ((ts / 60000) * 60000, ticker)
>>> }.groupByKey
>>>
>>> On Tue, Jan 6, 2015 at 8:47 PM, Asim Jalis <asimjalis@gmail.com> wrote:
>>>
>>>>  I guess I can use a similar groupBy approach. Map each event to all
>>>> the windows that it can belong to. Then do a groupBy, etc. I was wondering
>>>> if there was a more elegant approach.
>>>>
>>>> On Tue, Jan 6, 2015 at 3:45 PM, Asim Jalis <asimjalis@gmail.com> wrote:
>>>>
>>>>>  Except I want it to be a sliding window. So the same record could be
>>>>> in multiple buckets.
>>>>>
>>>>>
>>
>

Mime
View raw message