spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Juan Rodríguez Hortalá <>
Subject Re: Implementing a spark version of Haskell's partition
Date Wed, 17 Dec 2014 18:07:05 GMT
Hi Andy,  thanks for your response. I already thought about filtering
twice, that was what I meant with "that would be equivalent to applying
filter twice", but I was thinking if I could do it in a single pass, so
that could be later generalized to an arbitrary numbers of classes. I would
also like to be able to generate RDDs instead of partitions of a single
RDD, so I could use RDD methods like stats() on the fragments. But I think
there is currently no RDD method that returns more than one RDD for a
single input RDD, so maybe there is some design limitation on Spark that
prevents this?

Again, thanks for your answer.


El 17/12/2014 18:15, "andy petrella" <> escribió:

> yo,
> First, here is the scala version:
> >Boolean):(Repr,Repr)
> Second: RDD is distributed so what you'll have to do is to partition each
> partition each partition (:-D) or create two RDDs with by filtering twice →
> hence tasks will be scheduled distinctly, and data read twice. Choose
> what's best for you!
> hth,
> andy
> On Wed Dec 17 2014 at 5:57:56 PM Juan Rodríguez Hortalá <
>> wrote:
>> Hi all,
>> I would like to be able to split a RDD in two pieces according to a
>> predicate. That would be equivalent to applying filter twice, with the
>> predicate and its complement, which is also similar to Haskell's partition
>> list function (
>> There is currently any way to do this in Spark?, or maybe anyone has a
>> suggestion about how to implent this by modifying the Spark source. I think
>> this is valuable because sometimes I need to split a RDD in several groups
>> that are too big to fit in the memory of a single thread, so pair RDDs are
>> not solution for those cases. A generalization to n parts of Haskell's
>> partition would do the job.
>> Thanks a lot for your help.
>> Greetings,
>> Juan Rodriguez

View raw message