spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Ash <and...@andrewash.com>
Subject Re: Help needed. Not sure how to reduceByKey works in spark
Date Fri, 10 Jan 2014 19:09:42 GMT
So for each (col2, col3) pair, you want the difference between the earliest
col1 value and the latest col1 value?

I'd suggest something like this:

val data = sc.textFile(...).map(l => l.split("\t"))
data.map(r => ((r(1), r(2)), r(0)) // produce an RDD of ((col2, col3), col1)
    .groupByKey() // now have ((col2, col3) [col1s])
    .map(p => (p._1, (max(p._2) - min(p._2)))) // now have ((col2, col3),
diffInCol1s)

The downside of this approach is that if you have a (col2, col3) pair with
tons of col1 values, you might OOM one of your executors in the groupByKey.

Andrew


On Fri, Jan 10, 2014 at 11:01 AM, suman bharadwaj <suman.dna@gmail.com>wrote:

> Hi,
>
> I'm new to spark. And i needed some help in understanding how reduceByKey
> works.
>
> I have the following data:
>
> col1                                col2   col3
> 1/11/2014 12:18:40 AM    123     143
> 1/11/2014 12:18:45 AM    123     143
> 1/11/2014 12:18:49 AM    123     143
>
> the output i need is
>
> col2  col3    totaltime(currect value of col1 - prev val of col1)
> 123   143        9
>
> I'm doing the following:
>
> map((col2,col3),col1).reduceByKey( *<here i don't know how to perform the
> subtraction of dates > *)
>
> How to perform subtraction of dates ?
> How does reduceByKey work when my map emits as follows
> ((col2,col3),(col1,col4))?
>
>
> Thanks in advance.
>

Mime
View raw message