spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <so...@cloudera.com>
Subject Re: Cheapest way to materialize an RDD?
Date Fri, 30 Jan 2015 23:54:17 GMT
Yeah, from an unscientific test, it looks like the time to cache the
blocks still dominates. Saving the count is probably a win, but not
big. Well, maybe good to know.

On Fri, Jan 30, 2015 at 10:47 PM, Stephen Boesch <javadba@gmail.com> wrote:
> Theoretically your approach would require less overhead - i.e. a collect on
> the driver is not required as the last step.  But maybe the difference is
> small and that particular path may or may not have been properly optimized
> vs the count(). Do you have a biggish data set to compare the timings?
>
> 2015-01-30 14:42 GMT-08:00 Sean Owen <sowen@cloudera.com>:
>>
>> So far, the canonical way to materialize an RDD just to make sure it's
>> cached is to call count(). That's fine but incurs the overhead of
>> actually counting the elements.
>>
>> However, rdd.foreachPartition(p => None) for example also seems to
>> cause the RDD to be materialized, and is a no-op. Is that a better way
>> to do it or am I not thinking of why it's insufficient?
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message