spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <>
Subject Re: RDD.count
Date Sat, 28 Mar 2015 13:10:35 GMT
No, I'm not saying side effects change the count. But not executing
the map() function at all certainly has an effect on the side effects
of that function: the side effects which should take place never do. I
am not sure that is something to be 'fixed'; it's a legitimate

You can persist an RDD if you do not want to compute it twice.

On Sat, Mar 28, 2015 at 1:05 PM, jimfcarroll <> wrote:
> Hi Sean,
> Thanks for the response.
> I can't imagine a case (though my imagination may be somewhat limited) where
> even map side effects could change the number of elements in the resulting
> map.
> I guess "count" wouldn't officially be an 'action' if it were implemented
> this way. At least it wouldn't ALWAYS be one.
> My example was contrived. We're passing RDDs to functions. If that RDD is an
> instance of my class, then its count() may take a shortcut. If I
> map/zip/zipWithIndex/mapPartition/etc. first then I'm stuck with a call that
> literally takes 100s to 1000s of times longer (seconds vs hours on some of
> our datasets) and since my custom RDDs are immutable they cache the count
> call so a second invocation is the cost of a method call's overhead.
> I could fix this in Spark if there's any interest in that change. Otherwise
> I'll need to overload more RDD methods for my own purposes (like all of the
> transformations). Of course, that will be more difficult because those
> intermediate classes (like MappedRDD) are private, so I can't extend them.
> Jim
> --
> View this message in context:
> Sent from the Apache Spark Developers List mailing list archive at
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message