spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Hamstra <m...@clearstorydata.com>
Subject Re: Adding RDD function to segment an RDD (like substring)
Date Tue, 09 Dec 2014 21:16:51 GMT
`zipWithIndex` is both compute intensive and breaks Spark's
"transformations are lazy" model, so it is probably not appropriate to add
this to the public RDD API.  If `zipWithIndex` weren't already what I
consider to be broken, I'd be much friendlier to building something more on
top of it, but I really don't like `zipWithIndex` or this `sample` idea of
yours as they stand.  Aside from those concerns, the name `sample` doesn't
really work since it is too easy to assume that it means some kind of
random sampling.

None of that is to say that you can't make effective use of `zipWithIndex`
and your `sample` in your own code; but I'm not a fan of extending the
Spark public API in this way.

On Tue, Dec 9, 2014 at 12:38 PM, Ganelin, Ilya <Ilya.Ganelin@capitalone.com>
wrote:

> Hi all – a utility that I’ve found useful several times now when working
> with RDDs is to be able to reason about segments of the RDD.
>
> For example, if I have two large RDDs and I want to combine them in a way
> that would be intractable in terms of memory or disk storage (e.g. A
> cartesian) but a piece-wise approach is tractable, there are not many
> methods that enable this.
>
> Existing relevant methods are :
> takeSample – Which allows me to take a number of random values from an RDD
> but does not let me process the entire RDD
> randomSplit – Which segments the RDD into an array of smaller RDDs – with
> this, however, I’ve seen numerous memory issues during execution on larger
> datasets or when splitting into many RDDs
> forEach or collect – Which require the RDD to fit into memory – which is
> not possible for larger datasets
>
> What I am proposing is the addition of a simple method to operate on an
> RDD which would operate equivalently to a substring operation in String:
>
> For an RDD[T]:
>
> def sample(startIdx : Int, endIdx : Int) : RDD[T] = {
> val zipped = this.zipWithIndex()
> val sampled = filter(idx => (idx >= startIdx) && (idx < endIdx))
> sampled.map(_._1)
> }
>
> I would appreciate any feedback – thank you!
>
>
> ________________________________________________________
>
> The information contained in this e-mail is confidential and/or
> proprietary to Capital One and/or its affiliates. The information
> transmitted herewith is intended only for use by the individual or entity
> to which it is addressed.  If the reader of this message is not the
> intended recipient, you are hereby notified that any review,
> retransmission, dissemination, distribution, copying or other use of, or
> taking of any action in reliance upon this information is strictly
> prohibited. If you have received this communication in error, please
> contact the sender and delete the material from your computer.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message