But isn't foldLeft() overkill for the originally stated use case of max diff of adjacent pairs? Isn't foldLeft() for recursive non-commutative non-associative accumulation as opposed to an embarrassingly parallel operation such as this one?

This use case reminds me of FIR filtering in DSP. It seems that RDDs could use something that serves the same purpose as scala.collection.Iterator.sliding.

From: Koert Kuipers <koert@tresata.com>
To: Mohit Jaggi <mohitjaggi@gmail.com>
Cc: Tobias Pfeiffer <tgp@preferred.jp>; "Ganelin, Ilya" <Ilya.Ganelin@capitalone.com>; derrickburns <derrickrburns@gmail.com>; "user@spark.apache.org" <user@spark.apache.org>
Sent: Friday, January 30, 2015 7:11 AM
Subject: Re: spark challenge: zip with next???

assuming the data can be partitioned then you have many timeseries for which you want to detect potential gaps. also assuming the resulting gaps info per timeseries is much smaller data then the timeseries data itself, then this is a classical example to me of a sorted (streaming) foldLeft, requiring an efficient secondary sort in the spark shuffle. i am trying to get that into spark here:

On Fri, Jan 30, 2015 at 12:27 AM, Mohit Jaggi <mohitjaggi@gmail.com> wrote:

you can use the MLLib function or do the following (which is what I had done):

- in first pass over the data, using mapPartitionWithIndex, gather the first item in each partition. you can use collect (or aggregator) for this. “key” them by the partition index. at the end, you will have a map
   (partition index) --> first item
- in the second pass over the data, using mapPartitionWithIndex again, look at two (or in the general case N items at a time, you can use scala’s sliding iterator) items at a time and check the time difference(or any sliding window computation). To this mapParitition, pass the map created in previous step. You will need to use them to check the last item in this partition.

If you can tolerate a few inaccuracies then you can just do the second step. You will miss the “boundaries” of the partitions but it might be acceptable for your use case.

On Jan 29, 2015, at 4:36 PM, Tobias Pfeiffer <tgp@preferred.jp> wrote:


On Fri, Jan 30, 2015 at 6:32 AM, Ganelin, Ilya <Ilya.Ganelin@capitalone.com> wrote:
Make a copy of your RDD with an extra entry in the beginning to offset. The you can zip the two RDDs and run a map to generate an RDD of differences.

Does that work? I recently tried something to compute differences between each entry and the next, so I did
  val rdd1 = ... // null element + rdd
  val rdd2 = ... // rdd + null element
but got an error message about zip requiring data sizes in each partition to match.