spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Erlandson <...@redhat.com>
Subject Re: RFC: Supporting the Scala drop Method for Spark RDDs
Date Mon, 21 Jul 2014 15:53:14 GMT


----- Original Message -----
> I too would like this feature. Erik's post makes sense. However, shouldn't
> the RDD also repartition itself after drop to effectively make use of
> cluster resources?


My thinking is that in most use cases(*), one is dropping a small number of rows, and they
are in only the 1st partition, and so repartitioning would not be worth the cost.  The first
partition would be passed mostly intact, and the remainder would be completely unchanged.

(*) or at least most use cases that I've considered.


> On Jul 21, 2014 8:58 PM, "Andrew Ash [via Apache Spark Developers List]" <
> ml-node+s1001551n7434h99@n3.nabble.com> wrote:
> 
> > Personally I'd find the method useful -- I've often had a .csv file with a
> > header row that I want to drop so filter it out, which touches all
> > partitions anyway.  I don't have any comments on the implementation quite
> > yet though.
> >
> >
> > On Mon, Jul 21, 2014 at 8:24 AM, Erik Erlandson <[hidden email]
> > <http://user/SendEmail.jtp?type=node&node=7434&i=0>> wrote:
> >
> > > A few weeks ago I submitted a PR for supporting rdd.drop(n), under
> > > SPARK-2315:
> > > https://issues.apache.org/jira/browse/SPARK-2315
> > >
> > > Supporting the drop method would make some operations convenient,
> > however
> > > it forces computation of >= 1 partition of the parent RDD, and so it
> > would
> > > behave like a "partial action" that returns an RDD as the result.
> > >
> > > I wrote up a discussion of these trade-offs here:
> > >
> > >
> > http://erikerlandson.github.io/blog/2014/07/20/some-implications-of-supporting-the-scala-drop-method-for-spark-rdds/
> > >
> >
> >
> > ------------------------------
> >  If you reply to this email, your message will be added to the discussion
> > below:
> >
> > http://apache-spark-developers-list.1001551.n3.nabble.com/RFC-Supporting-the-Scala-drop-Method-for-Spark-RDDs-tp7433p7434.html
> >  To start a new topic under Apache Spark Developers List, email
> > ml-node+s1001551n1h76@n3.nabble.com
> > To unsubscribe from Apache Spark Developers List, click here
> > <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=YW5pa2V0LmJoYXRuYWdhckBnbWFpbC5jb218MXwxMzE3NTAzMzQz>
> > .
> > NAML
> > <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
> >
> 
> 
> 
> 
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/RFC-Supporting-the-Scala-drop-Method-for-Spark-RDDs-tp7433p7436.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.

Mime
View raw message