spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From andy petrella <andy.petre...@gmail.com>
Subject Re: Implementing a spark version of Haskell's partition
Date Wed, 17 Dec 2014 17:15:40 GMT
yo,

First, here is the scala version:
http://www.scala-lang.org/api/current/index.html#scala.collection.Seq@partition(p:A=
>Boolean):(Repr,Repr)

Second: RDD is distributed so what you'll have to do is to partition each
partition each partition (:-D) or create two RDDs with by filtering twice →
hence tasks will be scheduled distinctly, and data read twice. Choose
what's best for you!

hth,
andy


On Wed Dec 17 2014 at 5:57:56 PM Juan Rodríguez Hortalá <
juan.rodriguez.hortala@gmail.com> wrote:

> Hi all,
>
> I would like to be able to split a RDD in two pieces according to a
> predicate. That would be equivalent to applying filter twice, with the
> predicate and its complement, which is also similar to Haskell's partition
> list function (
> http://hackage.haskell.org/package/base-4.7.0.1/docs/Data-List.html).
> There is currently any way to do this in Spark?, or maybe anyone has a
> suggestion about how to implent this by modifying the Spark source. I think
> this is valuable because sometimes I need to split a RDD in several groups
> that are too big to fit in the memory of a single thread, so pair RDDs are
> not solution for those cases. A generalization to n parts of Haskell's
> partition would do the job.
>
> Thanks a lot for your help.
>
> Greetings,
>
> Juan Rodriguez
>

Mime
View raw message