spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jimfcarroll <>
Subject Re: RDD.count
Date Sun, 29 Mar 2015 01:14:06 GMT
Hello all,

I worked around this for now using the class (that I already had) that
inherits from RDD and is the one all of our custom RDDs inherit from. I did
the following:

1) Overload all of the transformations (that get used in our app) that don't
change the RDD size wrapping the results with a proxy rdd that intercepts
the count() call returning a cached version or calling an abstract
"calculateSize" if it doesn't already know the count.

2) piggyback a count calculation on all actions that we use (aggregate,
reduce, fold, foreach) so that as a side effect of calling any of these, if
the count isn't already known, it's calculated and stored.

The one thing I couldn't do (at least yet) was get zipWithIndex to calculate
the count because it's implementation is too opaque inside of the RDD.

If anyone wants to see the code I can post it.

Thanks for the responses.


View this message in context:
Sent from the Apache Spark Developers List mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message