spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Erlandson <...@redhat.com>
Subject Re: RFC: Supporting the Scala drop Method for Spark RDDs
Date Mon, 21 Jul 2014 17:24:14 GMT


----- Original Message -----
> Sure, drop() would be useful, but breaking the "transformations are lazy;
> only actions launch jobs" model is abhorrent -- which is not to say that we
> haven't already broken that model for useful operations (cf.
> RangePartitioner, which is used for sorted RDDs), but rather that each such
> exception to the model is a significant source of pain that can be hard to
> work with or work around.

A thought that comes to my mind here is that there are in fact already two categories of transform:
ones that are truly lazy, and ones that are not.  A possible option is to embrace that, and
commit to documenting the two categories as such, with an obvious bias towards favoring lazy
transforms (to paraphrase Churchill, we're down to haggling over the price).
 

> 
> I really wouldn't like to see another such model-breaking transformation
> added to the API.  On the other hand, being able to write transformations
> with dependencies on these kind of "internal" jobs is sometimes very
> useful, so a significant reworking of Spark's Dependency model that would
> allow for lazily running such internal jobs and making the results
> available to subsequent stages may be something worth pursuing.


This seems like a very interesting angle.   I don't have much feel for what a solution would
look like, but it sounds as if it would involve caching all operations embodied by RDD transform
method code for provisional execution.  I believe that these levels of invocation are currently
executed in the master, not executor nodes.


> 
> 
> On Mon, Jul 21, 2014 at 8:27 AM, Andrew Ash <andrew@andrewash.com> wrote:
> 
> > Personally I'd find the method useful -- I've often had a .csv file with a
> > header row that I want to drop so filter it out, which touches all
> > partitions anyway.  I don't have any comments on the implementation quite
> > yet though.
> >
> >
> > On Mon, Jul 21, 2014 at 8:24 AM, Erik Erlandson <eje@redhat.com> wrote:
> >
> > > A few weeks ago I submitted a PR for supporting rdd.drop(n), under
> > > SPARK-2315:
> > > https://issues.apache.org/jira/browse/SPARK-2315
> > >
> > > Supporting the drop method would make some operations convenient, however
> > > it forces computation of >= 1 partition of the parent RDD, and so it
> > would
> > > behave like a "partial action" that returns an RDD as the result.
> > >
> > > I wrote up a discussion of these trade-offs here:
> > >
> > >
> > http://erikerlandson.github.io/blog/2014/07/20/some-implications-of-supporting-the-scala-drop-method-for-spark-rdds/
> > >
> >
> 

Mime
View raw message