spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Derrick Burns <derrickrbu...@gmail.com>
Subject Re: spark challenge: zip with next???
Date Fri, 30 Jan 2015 20:47:13 GMT
Koert, thanks for the referral to your current pull request!  I found it
very thoughtful and thought-provoking.



On Fri, Jan 30, 2015 at 9:19 AM, Koert Kuipers <koert@tresata.com> wrote:

> and if its a single giant timeseries that is already sorted then Mohit's
> solution sounds good to me.
>
> On Fri, Jan 30, 2015 at 11:05 AM, Michael Malak <michaelmalak@yahoo.com>
> wrote:
>
>> But isn't foldLeft() overkill for the originally stated use case of max
>> diff of adjacent pairs? Isn't foldLeft() for recursive non-commutative
>> non-associative accumulation as opposed to an embarrassingly parallel
>> operation such as this one?
>>
>> This use case reminds me of FIR filtering in DSP. It seems that RDDs
>> could use something that serves the same purpose as
>> scala.collection.Iterator.sliding.
>>
>>   ------------------------------
>>  *From:* Koert Kuipers <koert@tresata.com>
>> *To:* Mohit Jaggi <mohitjaggi@gmail.com>
>> *Cc:* Tobias Pfeiffer <tgp@preferred.jp>; "Ganelin, Ilya" <
>> Ilya.Ganelin@capitalone.com>; derrickburns <derrickrburns@gmail.com>; "
>> user@spark.apache.org" <user@spark.apache.org>
>> *Sent:* Friday, January 30, 2015 7:11 AM
>> *Subject:* Re: spark challenge: zip with next???
>>
>> assuming the data can be partitioned then you have many timeseries for
>> which you want to detect potential gaps. also assuming the resulting gaps
>> info per timeseries is much smaller data then the timeseries data itself,
>> then this is a classical example to me of a sorted (streaming) foldLeft,
>> requiring an efficient secondary sort in the spark shuffle. i am trying to
>> get that into spark here:
>> https://issues.apache.org/jira/browse/SPARK-3655
>>
>>
>>
>> On Fri, Jan 30, 2015 at 12:27 AM, Mohit Jaggi <mohitjaggi@gmail.com>
>> wrote:
>>
>>
>> http://mail-archives.apache.org/mod_mbox/spark-user/201405.mbox/%3CCALRVTpKN65rOLzbETC+Ddk4O+YJm+TfAF5DZ8EuCpL-2YHYPZA@mail.gmail.com%3E
>>
>> you can use the MLLib function or do the following (which is what I had
>> done):
>>
>> - in first pass over the data, using mapPartitionWithIndex, gather the
>> first item in each partition. you can use collect (or aggregator) for this.
>> “key” them by the partition index. at the end, you will have a map
>>    (partition index) --> first item
>> - in the second pass over the data, using mapPartitionWithIndex again,
>> look at two (or in the general case N items at a time, you can use scala’s
>> sliding iterator) items at a time and check the time difference(or any
>> sliding window computation). To this mapParitition, pass the map created in
>> previous step. You will need to use them to check the last item in this
>> partition.
>>
>> If you can tolerate a few inaccuracies then you can just do the second
>> step. You will miss the “boundaries” of the partitions but it might be
>> acceptable for your use case.
>>
>>
>>
>> On Jan 29, 2015, at 4:36 PM, Tobias Pfeiffer <tgp@preferred.jp> wrote:
>>
>> Hi,
>>
>> On Fri, Jan 30, 2015 at 6:32 AM, Ganelin, Ilya <
>> Ilya.Ganelin@capitalone.com> wrote:
>>
>>  Make a copy of your RDD with an extra entry in the beginning to offset.
>> The you can zip the two RDDs and run a map to generate an RDD of
>> differences.
>>
>>
>> Does that work? I recently tried something to compute differences between
>> each entry and the next, so I did
>>   val rdd1 = ... // null element + rdd
>>   val rdd2 = ... // rdd + null element
>> but got an error message about zip requiring data sizes in each partition
>> to match.
>>
>> Tobias
>>
>>
>>
>>
>>
>>
>

Mime
View raw message