spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jimfcarroll <>
Subject Re: RDD.count
Date Sat, 28 Mar 2015 13:05:32 GMT
Hi Sean,

Thanks for the response.

I can't imagine a case (though my imagination may be somewhat limited) where
even map side effects could change the number of elements in the resulting

I guess "count" wouldn't officially be an 'action' if it were implemented
this way. At least it wouldn't ALWAYS be one.

My example was contrived. We're passing RDDs to functions. If that RDD is an
instance of my class, then its count() may take a shortcut. If I
map/zip/zipWithIndex/mapPartition/etc. first then I'm stuck with a call that
literally takes 100s to 1000s of times longer (seconds vs hours on some of
our datasets) and since my custom RDDs are immutable they cache the count
call so a second invocation is the cost of a method call's overhead.

I could fix this in Spark if there's any interest in that change. Otherwise
I'll need to overload more RDD methods for my own purposes (like all of the
transformations). Of course, that will be more difficult because those
intermediate classes (like MappedRDD) are private, so I can't extend them.


View this message in context:
Sent from the Apache Spark Developers List mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message