spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <so...@cloudera.com>
Subject Re: Creating time-sequential pairs
Date Sat, 10 May 2014 23:46:35 GMT
How about ...

val data = sc.parallelize(Array((1,0.05),(2,0.10),(3,0.15)))
val pairs = data.join(data.map(t => (t._1 + 1, t._2)))

It's a self-join, but one copy has its ID incremented by 1. I don't
know if it's performant but works, although output is more like:

(2,(0.1,0.05))
(3,(0.15,0.1))

On Thu, May 8, 2014 at 11:04 PM, Nicholas Pritchard
<nicholas.pritchard@falkonry.com> wrote:
> Hi Spark community,
>
> I have a design/algorithm question that I assume is common enough for
> someone else to have tackled before. I have an RDD of time-series data
> formatted as time-value tuples, RDD[(Double, Double)], and am trying to
> extract threshold crossings. In order to do so, I first want to transform
> the RDD into pairs of time-sequential values.
>
> For example:
> Input: The time-series data:
> (1, 0.05)
> (2, 0.10)
> (3, 0.15)
> Output: Transformed into time-sequential pairs:
> ((1, 0.05), (2, 0.10))
> ((2, 0.10), (3, 0.15))
>
> My initial thought was to try and utilize a custom partitioner. This
> partitioner could ensure sequential data was kept together. Then I could use
> "mapPartitions" to transform these lists of sequential data. Finally, I
> would need some logic for creating sequential pairs across the boundaries of
> each partition.
>
> However I was hoping to get some feedback and ideas from the community.
> Anyone have thoughts on a simpler solution?
>
> Thanks,
> Nick

Mime
View raw message