spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reynold Xin <r...@databricks.com>
Subject Re: RDD.count
Date Sat, 28 Mar 2015 23:34:45 GMT
I think the worry here is that people often use count() to force execution,
and when coupled with transformations with side-effect, it is no longer
safe to not run it.

However, maybe we can add a new lazy val .size that doesn't require
recomputation.


On Sat, Mar 28, 2015 at 7:42 AM, Sandy Ryza <sandy.ryza@cloudera.com> wrote:

> I definitely see the value in this.  However, I think at this point it
> would be an incompatible behavioral change.  People often use count in
> Spark to exercise their DAG.  Omitting processing steps that were
> previously included would likely mislead many users into thinking their
> pipeline was running faster.
>
> It's possible there might be room for something like a new smartCount API
> or a new argument to count that allows it to avoid unnecessary
> transformations.
>
> -Sandy
>
> On Sat, Mar 28, 2015 at 6:10 AM, Sean Owen <sowen@cloudera.com> wrote:
>
> > No, I'm not saying side effects change the count. But not executing
> > the map() function at all certainly has an effect on the side effects
> > of that function: the side effects which should take place never do. I
> > am not sure that is something to be 'fixed'; it's a legitimate
> > question.
> >
> > You can persist an RDD if you do not want to compute it twice.
> >
> > On Sat, Mar 28, 2015 at 1:05 PM, jimfcarroll <jimfcarroll@gmail.com>
> > wrote:
> > > Hi Sean,
> > >
> > > Thanks for the response.
> > >
> > > I can't imagine a case (though my imagination may be somewhat limited)
> > where
> > > even map side effects could change the number of elements in the
> > resulting
> > > map.
> > >
> > > I guess "count" wouldn't officially be an 'action' if it were
> implemented
> > > this way. At least it wouldn't ALWAYS be one.
> > >
> > > My example was contrived. We're passing RDDs to functions. If that RDD
> > is an
> > > instance of my class, then its count() may take a shortcut. If I
> > > map/zip/zipWithIndex/mapPartition/etc. first then I'm stuck with a call
> > that
> > > literally takes 100s to 1000s of times longer (seconds vs hours on some
> > of
> > > our datasets) and since my custom RDDs are immutable they cache the
> count
> > > call so a second invocation is the cost of a method call's overhead.
> > >
> > > I could fix this in Spark if there's any interest in that change.
> > Otherwise
> > > I'll need to overload more RDD methods for my own purposes (like all of
> > the
> > > transformations). Of course, that will be more difficult because those
> > > intermediate classes (like MappedRDD) are private, so I can't extend
> > them.
> > >
> > > Jim
> > >
> > >
> > >
> > >
> > > --
> > > View this message in context:
> >
> http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-count-tp11298p11302.html
> > > Sent from the Apache Spark Developers List mailing list archive at
> > Nabble.com.
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> > > For additional commands, e-mail: dev-help@spark.apache.org
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> > For additional commands, e-mail: dev-help@spark.apache.org
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message