spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Segerlind, Nathan L" <nathan.l.segerl...@intel.com>
Subject RE: RDD.aggregate versus accumulables...
Date Mon, 17 Nov 2014 18:22:17 GMT
Thanks for the link to the bug.

Unfortunately, using accumulators like this is getting spread around as a recommended practice
despite the bug.


From: Daniel Siegmann [mailto:daniel.siegmann@velos.io]
Sent: Monday, November 17, 2014 8:32 AM
To: Segerlind, Nathan L
Cc: user
Subject: Re: RDD.aggregate versus accumulables...

You should never use accumulators for this purpose because you may get incorrect answers.
Accumulators can count the same thing multiple times - you cannot rely upon the correctness
of the values they compute. See SPARK-732<https://issues.apache.org/jira/browse/SPARK-732>
for more info.

On Sun, Nov 16, 2014 at 10:06 PM, Segerlind, Nathan L <nathan.l.segerlind@intel.com<mailto:nathan.l.segerlind@intel.com>>
wrote:
Hi All.

I am trying to get my head around why using accumulators and accumulables seems to be the
most recommended method for accumulating running sums, averages, variances and the like, whereas
the aggregate method seems to me to be the right one. I have no performance measurements as
of yet, but it seems that aggregate is simpler and more intuitive (And it does what one might
expect an accumulator to do) whereas the accumulators and accumulables seem to have some extra
complications and overhead.

So…

What’s the real difference between an accumulator/accumulable and aggregating an RDD? When
is one method of aggregation preferred over the other?

Thanks,
Nate



--
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

54 W 40th St, New York, NY 10018
E: daniel.siegmann@velos.io<mailto:daniel.siegmann@velos.io> W: www.velos.io<http://www.velos.io>
Mime
View raw message