I believe your understanding is correct. And, yes, the processing should happen in parallel for each partition.


On Fri, Sep 13, 2013 at 10:56 AM, Fabrizio Milo aka misto <mistobaan@gmail.com> wrote:
@Jason

I find interesting but I am not sure I understand it completely.
I am assuming that the objective is to partition a dataset in two
sub-sets where the element of each set is randomly selected from the
first dataset.

If I read the code correctly what it does is:

For each partition creates a Random Seed.
for each element inside a partition generate a random number.
for each ( value, randomNumber)
create two filters, one for the elements that are less than the split
values and the other for elements that are greater than the
randomNumber.

And because the random generator is uniform then if we provide a p:Double = 0.3
roughly 30% of the random numbers will fall under 0.3.

Is this done on one node or on each node it will handle only the
mapping on the partition it received ? I guess/hope the second but I
want to make sure.

Thank you
--------------------------
Luck favors the prepared mind. (Pasteur)