spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jimfcarroll <jimfcarr...@gmail.com>
Subject RDD.count
Date Sat, 28 Mar 2015 00:54:58 GMT
Hi all,

I was wondering why the RDD.count call recomputes the RDD in all cases? In
most cases it can simply ask the next dependent RDD. I have several RDD
implementations and was surprised to see a call like the following never
call my RDD's count method but instead recompute/traverse the entire
dataset:

   val myRDD: MyRDD = ...
   myRDD.map({ ... }).count()

Unless I'm mistaken, a MappedRDD never needs to do more than call 'count' on
the underlying RDD. The underlying RDD's count method (in all of my cases)
know their count without a recompute (e.g. one of them selects the count
from a DB). This is MUCH less expensive than recomputing the RDD.

Thanks.
Jim




--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-count-tp11298.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Mime
View raw message