spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adrian Tanase <atan...@adobe.com>
Subject Re: Analyzing consecutive elements
Date Thu, 22 Oct 2015 07:43:21 GMT
I'm not sure if there is a better way to do it directly using Spark APIs but I would try to
use mapPartitions and then within each partition Iterable to:

rdd.zip(rdd.drop(1)) - using the Scala collection APIs

This should give you what you need inside a partition. I'm hoping that you can partition your
data somehow (e.g by user id or session id) that makes you algorithm parallel. In that case
you can use the snippet above in a reduceByKey.

hope this helps
-adrian

Sent from my iPhone

On 22 Oct 2015, at 09:36, Sampo Niskanen <sampo.niskanen@wellmo.com<mailto:sampo.niskanen@wellmo.com>>
wrote:

Hi,

I have analytics data with timestamps on each element.  I'd like to analyze consecutive elements
using Spark, but haven't figured out how to do this.

Essentially what I'd want is a transform from a sorted RDD [A, B, C, D, E] to an RDD [(A,B),
(B,C), (C,D), (D,E)].  (Or some other way to analyze time-related elements.)

How can this be achieved?


    Sampo Niskanen
    Lead developer / Wellmo

    sampo.niskanen@wellmo.com<mailto:sampo.niskanen@wellmo.com>
    +358 40 820 5291


Mime
View raw message