spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gen <gen.tan...@gmail.com>
Subject Re: RDD.aggregate?
Date Fri, 12 Dec 2014 16:30:31 GMT
Hi,

I am not an expert about this, but I will try to explain it :)(in python)

The aggregate actions return a accumulator and it involve three arguments:
zeroValue, seqOp, combOp
zeroValue is the initial local accumulator 
seqOp is the function which combines the elements from rdd with the
accumulator. It takes two arguments, the first one is the accumulator, the
second one is element of rdd
combOp is the function which merge two accumulators. it take two arguments:
two are accumulators and works as reduce.

example:

/data = sc.parallelize(range(100), 2)
l = data.aggregate(zeroValue = ([], 0, 0),
                              sepOp = lambda acc, value: (acc[0] + [value],
acc[1] + value, acc[2] + 1)
                              combOp = acc1, acc2 : (acc1[0] + acc2[0],
acc1[1] + acc2[1], acc1[2] + acc2[2])
)/

In the above example, we want to return a tuple with three element : the
first one is a list containing all the elements in the rdd; the second one
is the sum of elements and the third one the number of elements. The
elements of rdd is integer from 0 to 99 (partition by 2).
So zeroValue is initial accumulator value of the type we want to return(I
think that there are two local accumulators, as the number of partitions is
2 and it uses mapPartitions within the aggregate); 
sepOp add each element within the partition to the local accumulator (work
as acc = seqOp(acc, value) );
combOp merge the local accumulators to a single accumulator (it works as
reduce)

The final result is ([0,1, ..., 99], 4950, 100)

Hope that it could help you.

Cheers
Gen




ll wrote
> can someone please explain how RDD.aggregate works?  i looked at the
> average example done with aggregate() but i'm still confused about this
> function... much appreciated.





--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-aggregate-tp20434p20663.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message