spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Lenderman <jslender...@gmail.com>
Subject Re: RDD.subtract doesn't work
Date Sun, 15 Sep 2013 17:56:19 GMT
I believe your understanding is correct. And, yes, the processing should
happen in parallel for each partition.


On Fri, Sep 13, 2013 at 10:56 AM, Fabrizio Milo aka misto <
mistobaan@gmail.com> wrote:

> @Jason
>
> I find interesting but I am not sure I understand it completely.
> I am assuming that the objective is to partition a dataset in two
> sub-sets where the element of each set is randomly selected from the
> first dataset.
>
> If I read the code correctly what it does is:
>
> For each partition creates a Random Seed.
> for each element inside a partition generate a random number.
> for each ( value, randomNumber)
> create two filters, one for the elements that are less than the split
> values and the other for elements that are greater than the
> randomNumber.
>
> And because the random generator is uniform then if we provide a p:Double
> = 0.3
> roughly 30% of the random numbers will fall under 0.3.
>
> Is this done on one node or on each node it will handle only the
> mapping on the partition it received ? I guess/hope the second but I
> want to make sure.
>
> Thank you
> --------------------------
> Luck favors the prepared mind. (Pasteur)
>

Mime
View raw message