spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mohit Jaggi <>
Subject Re: spark challenge: zip with next???
Date Fri, 30 Jan 2015 05:27:48 GMT

you can use the MLLib function or do the following (which is what I had done):

- in first pass over the data, using mapPartitionWithIndex, gather the first item in each
partition. you can use collect (or aggregator) for this. “key” them by the partition index.
at the end, you will have a map
   (partition index) --> first item
- in the second pass over the data, using mapPartitionWithIndex again, look at two (or in
the general case N items at a time, you can use scala’s sliding iterator) items at a time
and check the time difference(or any sliding window computation). To this mapParitition, pass
the map created in previous step. You will need to use them to check the last item in this

If you can tolerate a few inaccuracies then you can just do the second step. You will miss
the “boundaries” of the partitions but it might be acceptable for your use case.

> On Jan 29, 2015, at 4:36 PM, Tobias Pfeiffer <> wrote:
> Hi,
> On Fri, Jan 30, 2015 at 6:32 AM, Ganelin, Ilya < <>>
> Make a copy of your RDD with an extra entry in the beginning to offset. The you can zip
the two RDDs and run a map to generate an RDD of differences.
> Does that work? I recently tried something to compute differences between each entry
and the next, so I did
>   val rdd1 = ... // null element + rdd
>   val rdd2 = ... // rdd + null element
> but got an error message about zip requiring data sizes in each partition to match.
> Tobias

View raw message