spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Juan Rodríguez Hortalá <>
Subject Re: Implementing a spark version of Haskell's partition
Date Thu, 18 Dec 2014 10:30:44 GMT
Hi Andy,

Thanks again for your thoughts on this, I haven't found much information
about the internals of Spark, so I find very useful and interesting these
kind of explanations about its low level mechanisms. It's also nice to know
that the two pass approach is a viable solution.



2014-12-18 11:10 GMT+01:00 andy petrella <>:
> NP man,
> The thing is that since you're in a dist env, it'd be cumbersome to do
> that. Remember that Spark works basically on block/partition, they are the
> unit of distribution and parallelization.
> That means that actions have to be run against it **after having been
> scheduled on the cluster**.
> The latter point is the most important, it means that the RDD aren't
> "really" created on the driver the collection is created/transformed/... on
> the partition.
> Consequence of what you cannot, on the driver, create such representation
> on the distributed collection since you haven't seen it yet.
> That being said, you can only prepare/define some computations on the
> driver that will segregate the data by applying a filter on the nodes.
> If you want to keep RDD operators as they are, yes you'll need to pass
> over the distributed data twice.
> The option of using the mapPartitions for instance, will be to create a
> RDD[Seq[A], Seq[A]] however it's going to be tricky because you'll might
> have to repartition otherwise the OOMs might blow at your face :-D.
> I won't pick that one!
> A final note: looping over the data is not that a problem (specially if
> you can cache it), and in fact it's way better to keep advantage of
> resilience etc etc that comes with Spark.
> my2c
> andy
> On Wed Dec 17 2014 at 7:07:05 PM Juan Rodríguez Hortalá <
>> wrote:
>> Hi Andy,  thanks for your response. I already thought about filtering
>> twice, that was what I meant with "that would be equivalent to applying
>> filter twice", but I was thinking if I could do it in a single pass, so
>> that could be later generalized to an arbitrary numbers of classes. I would
>> also like to be able to generate RDDs instead of partitions of a single
>> RDD, so I could use RDD methods like stats() on the fragments. But I think
>> there is currently no RDD method that returns more than one RDD for a
>> single input RDD, so maybe there is some design limitation on Spark that
>> prevents this?
>> Again, thanks for your answer.
>> Greetings,
>> Juan
>> El 17/12/2014 18:15, "andy petrella" <> escribió:
>> yo,
>>> First, here is the scala version:
>>> >Boolean):(Repr,Repr)
>>> Second: RDD is distributed so what you'll have to do is to partition
>>> each partition each partition (:-D) or create two RDDs with by filtering
>>> twice → hence tasks will be scheduled distinctly, and data read twice.
>>> Choose what's best for you!
>>> hth,
>>> andy
>>> On Wed Dec 17 2014 at 5:57:56 PM Juan Rodríguez Hortalá <
>>>> wrote:
>>>> Hi all,
>>>> I would like to be able to split a RDD in two pieces according to a
>>>> predicate. That would be equivalent to applying filter twice, with the
>>>> predicate and its complement, which is also similar to Haskell's partition
>>>> list function (
>>>> There is currently any way to do this in Spark?, or maybe anyone has a
>>>> suggestion about how to implent this by modifying the Spark source. I think
>>>> this is valuable because sometimes I need to split a RDD in several groups
>>>> that are too big to fit in the memory of a single thread, so pair RDDs are
>>>> not solution for those cases. A generalization to n parts of Haskell's
>>>> partition would do the job.
>>>> Thanks a lot for your help.
>>>> Greetings,
>>>> Juan Rodriguez

View raw message