spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From andy petrella <>
Subject Re: Implementing a spark version of Haskell's partition
Date Wed, 17 Dec 2014 17:15:40 GMT

First, here is the scala version:

Second: RDD is distributed so what you'll have to do is to partition each
partition each partition (:-D) or create two RDDs with by filtering twice →
hence tasks will be scheduled distinctly, and data read twice. Choose
what's best for you!


On Wed Dec 17 2014 at 5:57:56 PM Juan Rodríguez Hortalá <> wrote:

> Hi all,
> I would like to be able to split a RDD in two pieces according to a
> predicate. That would be equivalent to applying filter twice, with the
> predicate and its complement, which is also similar to Haskell's partition
> list function (
> There is currently any way to do this in Spark?, or maybe anyone has a
> suggestion about how to implent this by modifying the Spark source. I think
> this is valuable because sometimes I need to split a RDD in several groups
> that are too big to fit in the memory of a single thread, so pair RDDs are
> not solution for those cases. A generalization to n parts of Haskell's
> partition would do the job.
> Thanks a lot for your help.
> Greetings,
> Juan Rodriguez

View raw message