spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fabrizio Milo aka misto <mistob...@gmail.com>
Subject Re: RDD.subtract doesn't work
Date Fri, 13 Sep 2013 17:56:45 GMT
@Jason

I find interesting but I am not sure I understand it completely.
I am assuming that the objective is to partition a dataset in two
sub-sets where the element of each set is randomly selected from the
first dataset.

If I read the code correctly what it does is:

For each partition creates a Random Seed.
for each element inside a partition generate a random number.
for each ( value, randomNumber)
create two filters, one for the elements that are less than the split
values and the other for elements that are greater than the
randomNumber.

And because the random generator is uniform then if we provide a p:Double = 0.3
roughly 30% of the random numbers will fall under 0.3.

Is this done on one node or on each node it will handle only the
mapping on the partition it received ? I guess/hope the second but I
want to make sure.

Thank you
--------------------------
Luck favors the prepared mind. (Pasteur)

Mime
View raw message