I find interesting but I am not sure I understand it completely.
I am assuming that the objective is to partition a dataset in two
sub-sets where the element of each set is randomly selected from the
If I read the code correctly what it does is:
For each partition creates a Random Seed.
for each element inside a partition generate a random number.
for each ( value, randomNumber)
create two filters, one for the elements that are less than the split
values and the other for elements that are greater than the
And because the random generator is uniform then if we provide a p:Double = 0.3
roughly 30% of the random numbers will fall under 0.3.
Is this done on one node or on each node it will handle only the
mapping on the partition it received ? I guess/hope the second but I
want to make sure.
Luck favors the prepared mind. (Pasteur)