spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Juan Rodríguez Hortalá <juan.rodriguez.hort...@gmail.com>
Subject Re: Implementing a spark version of Haskell's partition
Date Wed, 17 Dec 2014 18:07:05 GMT
Hi Andy,  thanks for your response. I already thought about filtering
twice, that was what I meant with "that would be equivalent to applying
filter twice", but I was thinking if I could do it in a single pass, so
that could be later generalized to an arbitrary numbers of classes. I would
also like to be able to generate RDDs instead of partitions of a single
RDD, so I could use RDD methods like stats() on the fragments. But I think
there is currently no RDD method that returns more than one RDD for a
single input RDD, so maybe there is some design limitation on Spark that
prevents this?

Again, thanks for your answer.

Greetings,

Juan
El 17/12/2014 18:15, "andy petrella" <andy.petrella@gmail.com> escribió:

> yo,
>
> First, here is the scala version:
> http://www.scala-lang.org/api/current/index.html#scala.collection.Seq@partition(p:A=
> >Boolean):(Repr,Repr)
>
> Second: RDD is distributed so what you'll have to do is to partition each
> partition each partition (:-D) or create two RDDs with by filtering twice →
> hence tasks will be scheduled distinctly, and data read twice. Choose
> what's best for you!
>
> hth,
> andy
>
>
> On Wed Dec 17 2014 at 5:57:56 PM Juan Rodríguez Hortalá <
> juan.rodriguez.hortala@gmail.com> wrote:
>
>> Hi all,
>>
>> I would like to be able to split a RDD in two pieces according to a
>> predicate. That would be equivalent to applying filter twice, with the
>> predicate and its complement, which is also similar to Haskell's partition
>> list function (
>> http://hackage.haskell.org/package/base-4.7.0.1/docs/Data-List.html).
>> There is currently any way to do this in Spark?, or maybe anyone has a
>> suggestion about how to implent this by modifying the Spark source. I think
>> this is valuable because sometimes I need to split a RDD in several groups
>> that are too big to fit in the memory of a single thread, so pair RDDs are
>> not solution for those cases. A generalization to n parts of Haskell's
>> partition would do the job.
>>
>> Thanks a lot for your help.
>>
>> Greetings,
>>
>> Juan Rodriguez
>>
>

Mime
View raw message