spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <so...@cloudera.com>
Subject Re: RDD.count
Date Sat, 28 Mar 2015 13:10:35 GMT
No, I'm not saying side effects change the count. But not executing
the map() function at all certainly has an effect on the side effects
of that function: the side effects which should take place never do. I
am not sure that is something to be 'fixed'; it's a legitimate
question.

You can persist an RDD if you do not want to compute it twice.

On Sat, Mar 28, 2015 at 1:05 PM, jimfcarroll <jimfcarroll@gmail.com> wrote:
> Hi Sean,
>
> Thanks for the response.
>
> I can't imagine a case (though my imagination may be somewhat limited) where
> even map side effects could change the number of elements in the resulting
> map.
>
> I guess "count" wouldn't officially be an 'action' if it were implemented
> this way. At least it wouldn't ALWAYS be one.
>
> My example was contrived. We're passing RDDs to functions. If that RDD is an
> instance of my class, then its count() may take a shortcut. If I
> map/zip/zipWithIndex/mapPartition/etc. first then I'm stuck with a call that
> literally takes 100s to 1000s of times longer (seconds vs hours on some of
> our datasets) and since my custom RDDs are immutable they cache the count
> call so a second invocation is the cost of a method call's overhead.
>
> I could fix this in Spark if there's any interest in that change. Otherwise
> I'll need to overload more RDD methods for my own purposes (like all of the
> transformations). Of course, that will be more difficult because those
> intermediate classes (like MappedRDD) are private, so I can't extend them.
>
> Jim
>
>
>
>
> --
> View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-count-tp11298p11302.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Mime
View raw message