spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Hamstra <m...@clearstorydata.com>
Subject Re: RFC: Supporting the Scala drop Method for Spark RDDs
Date Mon, 21 Jul 2014 18:06:34 GMT
Rather than embrace non-lazy transformations and add more of them, I'd
rather we 1) try to fully characterize the needs that are driving their
creation/usage; and 2) design and implement new Spark abstractions that
will allow us to meet those needs and eliminate existing non-lazy
transformation.  They really mess up things like creation of asynchronous
FutureActions, job cancellation and accounting of job resource usage, etc.,
so I'd rather we seek a way out of the existing hole rather than make it
deeper.


On Mon, Jul 21, 2014 at 10:24 AM, Erik Erlandson <eje@redhat.com> wrote:

>
>
> ----- Original Message -----
> > Sure, drop() would be useful, but breaking the "transformations are lazy;
> > only actions launch jobs" model is abhorrent -- which is not to say that
> we
> > haven't already broken that model for useful operations (cf.
> > RangePartitioner, which is used for sorted RDDs), but rather that each
> such
> > exception to the model is a significant source of pain that can be hard
> to
> > work with or work around.
>
> A thought that comes to my mind here is that there are in fact already two
> categories of transform: ones that are truly lazy, and ones that are not.
>  A possible option is to embrace that, and commit to documenting the two
> categories as such, with an obvious bias towards favoring lazy transforms
> (to paraphrase Churchill, we're down to haggling over the price).
>
>
> >
> > I really wouldn't like to see another such model-breaking transformation
> > added to the API.  On the other hand, being able to write transformations
> > with dependencies on these kind of "internal" jobs is sometimes very
> > useful, so a significant reworking of Spark's Dependency model that would
> > allow for lazily running such internal jobs and making the results
> > available to subsequent stages may be something worth pursuing.
>
>
> This seems like a very interesting angle.   I don't have much feel for
> what a solution would look like, but it sounds as if it would involve
> caching all operations embodied by RDD transform method code for
> provisional execution.  I believe that these levels of invocation are
> currently executed in the master, not executor nodes.
>
>
> >
> >
> > On Mon, Jul 21, 2014 at 8:27 AM, Andrew Ash <andrew@andrewash.com>
> wrote:
> >
> > > Personally I'd find the method useful -- I've often had a .csv file
> with a
> > > header row that I want to drop so filter it out, which touches all
> > > partitions anyway.  I don't have any comments on the implementation
> quite
> > > yet though.
> > >
> > >
> > > On Mon, Jul 21, 2014 at 8:24 AM, Erik Erlandson <eje@redhat.com>
> wrote:
> > >
> > > > A few weeks ago I submitted a PR for supporting rdd.drop(n), under
> > > > SPARK-2315:
> > > > https://issues.apache.org/jira/browse/SPARK-2315
> > > >
> > > > Supporting the drop method would make some operations convenient,
> however
> > > > it forces computation of >= 1 partition of the parent RDD, and so it
> > > would
> > > > behave like a "partial action" that returns an RDD as the result.
> > > >
> > > > I wrote up a discussion of these trade-offs here:
> > > >
> > > >
> > >
> http://erikerlandson.github.io/blog/2014/07/20/some-implications-of-supporting-the-scala-drop-method-for-spark-rdds/
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message